[jira] [Closed] (SPARK-6424) Support user-defined aggregators in AggregateFunction
[ https://issues.apache.org/jira/browse/SPARK-6424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro closed SPARK-6424. --- Resolution: Duplicate > Support user-defined aggregators in AggregateFunction > - > > Key: SPARK-6424 > URL: https://issues.apache.org/jira/browse/SPARK-6424 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro > > Add a new interface to implement user-defined aggregators in > AggregateExpression. > This enables third-parties to easily implement various operations on > DataFrame. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6424) Support user-defined aggregators in AggregateFunction
[ https://issues.apache.org/jira/browse/SPARK-6424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941589#comment-15941589 ] Takeshi Yamamuro commented on SPARK-6424: - We already have fixed the Aggregation improvements in SPARK-4366, so I'll close this. > Support user-defined aggregators in AggregateFunction > - > > Key: SPARK-6424 > URL: https://issues.apache.org/jira/browse/SPARK-6424 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro > > Add a new interface to implement user-defined aggregators in > AggregateExpression. > This enables third-parties to easily implement various operations on > DataFrame. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API
[ https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941577#comment-15941577 ] Hyukjin Kwon commented on SPARK-20090: -- Hm.. don't we have {{names}}? Scala {code} scala> spark.range(1).schema.fieldNames res1: Array[String] = Array(id) {code} Python {code} >>> spark.range(1).schema.names ['id'] {code} > Add StructType.fieldNames to Python API > --- > > Key: SPARK-20090 > URL: https://issues.apache.org/jira/browse/SPARK-20090 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Joseph K. Bradley >Priority: Trivial > > The Scala/Java API for {{StructType}} has a method {{fieldNames}}. It would > be nice if the Python {{StructType}} did as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17137: Assignee: Apache Spark (was: Seth Hendrickson) > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Apache Spark >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941560#comment-15941560 ] Apache Spark commented on SPARK-17137: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/17426 > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17137: Assignee: Seth Hendrickson (was: Apache Spark) > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20091) DagScheduler should allow running concurrent attempts of a stage in case of multiple fetch failure
Sital Kedia created SPARK-20091: --- Summary: DagScheduler should allow running concurrent attempts of a stage in case of multiple fetch failure Key: SPARK-20091 URL: https://issues.apache.org/jira/browse/SPARK-20091 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.0.1 Reporter: Sital Kedia Currently, the Dag scheduler does not allow running concurrent attempts of a stage in case of multiple fetch failure. As a result, in case of multipe fetch failures are detected, serial execution of map stage delays the job run significantly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20090) Add StructType.fieldNames to Python API
Joseph K. Bradley created SPARK-20090: - Summary: Add StructType.fieldNames to Python API Key: SPARK-20090 URL: https://issues.apache.org/jira/browse/SPARK-20090 Project: Spark Issue Type: New Feature Components: PySpark, SQL Affects Versions: 2.1.0 Reporter: Joseph K. Bradley Priority: Trivial The Scala/Java API for {{StructType}} has a method {{fieldNames}}. It would be nice if the Python {{StructType}} did as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20089) Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-20089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941410#comment-15941410 ] Apache Spark commented on SPARK-20089: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17424 > Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite > - > > Key: SPARK-20089 > URL: https://issues.apache.org/jira/browse/SPARK-20089 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {{DESC FUNCTION}} and {{DESC EXTENDED FUNCTION}} should be added to > SQLQueryTestSuite for us to review the outputs of each function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20089) Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-20089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20089: Assignee: Apache Spark (was: Xiao Li) > Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite > - > > Key: SPARK-20089 > URL: https://issues.apache.org/jira/browse/SPARK-20089 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > {{DESC FUNCTION}} and {{DESC EXTENDED FUNCTION}} should be added to > SQLQueryTestSuite for us to review the outputs of each function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20089) Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-20089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20089: Assignee: Xiao Li (was: Apache Spark) > Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite > - > > Key: SPARK-20089 > URL: https://issues.apache.org/jira/browse/SPARK-20089 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {{DESC FUNCTION}} and {{DESC EXTENDED FUNCTION}} should be added to > SQLQueryTestSuite for us to review the outputs of each function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20089) Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-20089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20089: Summary: Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite (was: Added DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite) > Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite > - > > Key: SPARK-20089 > URL: https://issues.apache.org/jira/browse/SPARK-20089 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {{DESC FUNCTION}} and {{DESC EXTENDED FUNCTION}} should be added to > SQLQueryTestSuite for us to review the outputs of each function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20089) Added DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite
Xiao Li created SPARK-20089: --- Summary: Added DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite Key: SPARK-20089 URL: https://issues.apache.org/jira/browse/SPARK-20089 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.2.0 Reporter: Xiao Li Assignee: Xiao Li {{DESC FUNCTION}} and {{DESC EXTENDED FUNCTION}} should be added to SQLQueryTestSuite for us to review the outputs of each function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext
[ https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20088: Assignee: (was: Apache Spark) > Do not create new SparkContext in SparkR createSparkContext > --- > > Key: SPARK-20088 > URL: https://issues.apache.org/jira/browse/SPARK-20088 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > In the implementation of {{createSparkContext}}, we are calling > {code} > new JavaSparkContext() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext
[ https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20088: Assignee: Apache Spark > Do not create new SparkContext in SparkR createSparkContext > --- > > Key: SPARK-20088 > URL: https://issues.apache.org/jira/browse/SPARK-20088 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Assignee: Apache Spark > > In the implementation of {{createSparkContext}}, we are calling > {code} > new JavaSparkContext() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext
[ https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941394#comment-15941394 ] Apache Spark commented on SPARK-20088: -- User 'falaki' has created a pull request for this issue: https://github.com/apache/spark/pull/17423 > Do not create new SparkContext in SparkR createSparkContext > --- > > Key: SPARK-20088 > URL: https://issues.apache.org/jira/browse/SPARK-20088 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > In the implementation of {{createSparkContext}}, we are calling > {code} > new JavaSparkContext() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext
Hossein Falaki created SPARK-20088: -- Summary: Do not create new SparkContext in SparkR createSparkContext Key: SPARK-20088 URL: https://issues.apache.org/jira/browse/SPARK-20088 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Hossein Falaki In the implementation of {{createSparkContext}}, we are calling {code} new JavaSparkContext() {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19408) cardinality estimation involving two columns of the same table
[ https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Hu updated SPARK-19408: --- Target Version/s: 2.2.0 (was: 2.3.0) > cardinality estimation involving two columns of the same table > -- > > Key: SPARK-19408 > URL: https://issues.apache.org/jira/browse/SPARK-19408 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu > > In SPARK-17075, we estimate cardinality of predicate expression "column (op) > literal", where op is =, <, <=, >, or >=. In SQL queries, we also see > predicate expressions involving two columns such as "column-1 (op) column-2" > where column-1 and column-2 belong to same table. Note that, if column-1 and > column-2 belong to different tables, then it is a join operator's work, NOT a > filter operator's work. > In this jira, we want to estimate the filter factor of predicate expressions > involving two columns of same table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners
Charles Lewis created SPARK-20087: - Summary: Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners Key: SPARK-20087 URL: https://issues.apache.org/jira/browse/SPARK-20087 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Charles Lewis When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive accumulators / task metrics for that task, if they were still available. These metrics are not currently sent when tasks are killed intentionally, such as when a speculative retry finishes, and the original is killed (or vice versa). Since we're killing these tasks ourselves, these metrics should almost always exist, and we should treat them the same way as we treat ExceptionFailures. Sending these metrics with the TaskKilled end reason makes aggregation across all tasks in an app more accurate. This data can inform decisions about how to tune the speculation parameters in order to minimize duplicated work, and in general, the total cost of an app should include both successful and failed tasks, if that information exists. PR: https://github.com/apache/spark/pull/17422 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20040) Python API for ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20040: Assignee: Apache Spark > Python API for ml.stat.ChiSquareTest > > > Key: SPARK-20040 > URL: https://issues.apache.org/jira/browse/SPARK-20040 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Add PySpark wrapper for ChiSquareTest. Note that it's currently called > ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20040) Python API for ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941367#comment-15941367 ] Apache Spark commented on SPARK-20040: -- User 'MrBago' has created a pull request for this issue: https://github.com/apache/spark/pull/17421 > Python API for ml.stat.ChiSquareTest > > > Key: SPARK-20040 > URL: https://issues.apache.org/jira/browse/SPARK-20040 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > Add PySpark wrapper for ChiSquareTest. Note that it's currently called > ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20040) Python API for ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20040: Assignee: (was: Apache Spark) > Python API for ml.stat.ChiSquareTest > > > Key: SPARK-20040 > URL: https://issues.apache.org/jira/browse/SPARK-20040 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > Add PySpark wrapper for ChiSquareTest. Note that it's currently called > ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20070) Redact datasource explain output
[ https://issues.apache.org/jira/browse/SPARK-20070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941345#comment-15941345 ] Apache Spark commented on SPARK-20070: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/17420 > Redact datasource explain output > > > Key: SPARK-20070 > URL: https://issues.apache.org/jira/browse/SPARK-20070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0 > > > When calling explain on a datasource, the output can contain sensitive > information. We should provide an admin/user to redact such information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941322#comment-15941322 ] Apache Spark commented on SPARK-19634: -- User 'thunterdb' has created a pull request for this issue: https://github.com/apache/spark/pull/17419 > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19634: Assignee: Apache Spark (was: Timothy Hunter) > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Apache Spark > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19634: Assignee: Timothy Hunter (was: Apache Spark) > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19846) Add a flag to disable constraint propagation
[ https://issues.apache.org/jira/browse/SPARK-19846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-19846. - Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.2.0 > Add a flag to disable constraint propagation > > > Key: SPARK-19846 > URL: https://issues.apache.org/jira/browse/SPARK-19846 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.2.0 > > > Constraint propagation can be computation expensive and block the driver > execution for long time. For example, the below benchmark needs 30mins. > Compared with other attempts to modify how constraints propagation works, > this is a much simpler option: add a flag to disable constraint propagation. > {code} > import org.apache.spark.ml.{Pipeline, PipelineStage} > import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, > VectorAssembler} > spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false) > val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, > "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) > val indexers = df.columns.tail.map(c => new StringIndexer() > .setInputCol(c) > .setOutputCol(s"${c}_indexed") > .setHandleInvalid("skip")) > val encoders = indexers.map(indexer => new OneHotEncoder() > .setInputCol(indexer.getOutputCol) > .setOutputCol(s"${indexer.getOutputCol}_encoded") > .setDropLast(true)) > val stages: Array[PipelineStage] = indexers ++ encoders > val pipeline = new Pipeline().setStages(stages) > val startTime = System.nanoTime > pipeline.fit(df).transform(df).show > val runningTime = System.nanoTime - startTime > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20070) Redact datasource explain output
[ https://issues.apache.org/jira/browse/SPARK-20070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20070. - Resolution: Fixed Fix Version/s: 2.2.0 > Redact datasource explain output > > > Key: SPARK-20070 > URL: https://issues.apache.org/jira/browse/SPARK-20070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0 > > > When calling explain on a datasource, the output can contain sensitive > information. We should provide an admin/user to redact such information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987 ] Jouni H edited comment on SPARK-12216 at 3/24/17 10:25 PM: --- I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. aws-java-sdk-1.7.4.jar can be downloaded from here: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar bugtest.txt: {noformat} one two three {noformat} sparkbugtest.py: {noformat} import sys import time from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) time.sleep(10) # just for debug to see how files change in temporary folder # at this point there is only this script (sparkbugtest.py) in the temporary folder lines.saveAsTextFile(sys.argv[2]) # at this point there is both sparkbugtest.py and aws-java-sdk-1.7.4.jar in the temporary folder time.sleep(10) # for debug if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html What happens in the userFiles tmp folder is interesting: * At first there is the {{sparkbugtest.py}} * At the end (I think during saveAsTextFile, or after it), the {{aws-java-sdk-1.7.4.jar}} file is copied there and the {{sparkbugtest.py}} get's deleted * After the spark-submit has ended the {{aws-java-sdk-1.7.4.jar}} is still in the temporary folder that couldn't be deleted The temp folder in this example was like: C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\ was (Author: jouni): I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. aws-java-sdk-1.7.4.jar can be downloaded from here: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar The input in the bugtest.txt doesn't matter. Example script: {noformat} import sys from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) lines.saveAsTextFile(sys.argv[2]) if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html What happens in the userFiles tmp folder is interesting: * At first there is the {{sparkbugtest.py}} * At the end (I think during saveAsTextFile, or after it), the {{aws-java-sdk-1.7.4.jar}} file is copied there and the {{sparkbugtest.py}} get's deleted * After the spark-submit has ended the {{aws-java-sdk-1.7.4.jar}} is still in the temporary folder that couldn't be deleted The temp folder in this example was like: C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\ > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH
[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mandar uapdhye updated SPARK-20086: --- Description: original post at [stackoverflow | http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] I get error when working with pyspark window function. here is some example code: {code:title=borderStyle=solid} import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes from pyspark.sql import window sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) df.show() {code} gives: | x|AmtPaid| | 1|2.0| | 1|3.0| | 1|1.0| | 1| -2.0| | 1| -1.0| next, compute cumulative sum {code:title=test.py|borderStyle=solid} win_spec_max = (window.Window .partitionBy(['x']) .rowsBetween(window.Window.unboundedPreceding, 0))) df = df.withColumn('AmtPaidCumSum', sf.sum(sf.col('AmtPaid')).over(win_spec_max)) df.show() {code} gives, | x|AmtPaid|AmtPaidCumSum| | 1|2.0| 2.0| | 1|3.0| 5.0| | 1|1.0| 6.0| | 1| -2.0| 4.0| | 1| -1.0| 3.0| next, compute cumulative max, {code} df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} gives error log {noformat} Py4JJavaError: An error occurred while calling o2609.showString. with traceback: Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """ 317 if isinstance(truncate, bool) and truncate: --> 318 print(self._jdf.showString(n, 20)) 319 else: 320 print(self._jdf.showString(n, int(truncate))) /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( {noformat} but interestingly enough, if i introduce another change before sencond window operation, say inserting a column then it does not give that error: {code} df = df.withColumn('MaxBound', sf.lit(6.)) df.show() {code} | x|AmtPaid|AmtPaidCumSum|MaxBound| | 1|2.0| 2.0| 6.0| | 1|3.0| 5.0| 6.0| | 1|1.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| | 1| -1.0| 3.0| 6.0| {code} #then apply the second window operations df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| | 1|2.0| 2.0| 6.0| 2.0| | 1|3.0| 5.0| 6.0| 5.0| | 1|1.0| 6.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| 6.0| | 1| -1.0| 3.0| 6.0| 6.0| I do not understand this behaviour well, so far so good, but then I try another operation then again get similar error: {code} def _udf_compare_cumsum_sll(x): if x['AmtPaidCumSumMax'] >= x['MaxBound']: output = 0 else: output = x['AmtPaid'] return output udf_compare_cumsum_sll = sf.udf(_udf_compare_cumsum_sll, sparktypes.FloatType()) df = df.withColumn('AmtPaidAdjusted', udf_compare_cumsum_sll(sf.struct([df[x] for x in df.columns]))) df.show() {code} gives, {noformat} Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987 ] Jouni H edited comment on SPARK-12216 at 3/24/17 9:58 PM: -- I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. aws-java-sdk-1.7.4.jar can be downloaded from here: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar The input in the bugtest.txt doesn't matter. Example script: {noformat} import sys from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) lines.saveAsTextFile(sys.argv[2]) if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html What happens in the userFiles tmp folder is interesting: * At first there is the {{sparkbugtest.py}} * At the end (I think during saveAsTextFile, or after it), the {{aws-java-sdk-1.7.4.jar}} file is copied there and the {{sparkbugtest.py}} get's deleted * After the spark-submit has ended the {{aws-java-sdk-1.7.4.jar}} is still in the temporary folder that couldn't be deleted The temp folder in this example was like: C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\ was (Author: jouni): I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. aws-java-sdk-1.7.4.jar can be downloaded from here: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar The input in the bugtest.txt doesn't matter. Example script: {noformat} import sys from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) lines.saveAsTextFile(sys.argv[2]) if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html What happens in the userFiles tmp folder is interesting: * At first there is the {{sparkbugtest.py}} * At the end, the {{aws-java-sdk-1.7.4.jar}} file is copied there and the {{sparkbugtest.py}} get's deleted * After the spark-submit has ended the {{aws-java-sdk-1.7.4.jar}} is still in the temporary folder that couldn't be deleted The temp folder in this example was like: C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\ > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority:
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987 ] Jouni H edited comment on SPARK-12216 at 3/24/17 9:54 PM: -- I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. aws-java-sdk-1.7.4.jar can be downloaded from here: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar The input in the bugtest.txt doesn't matter. Example script: {noformat} import sys from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) lines.saveAsTextFile(sys.argv[2]) if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html What happens in the userFiles tmp folder is interesting: * At first there is the {{sparkbugtest.py}} * At the end, the {{aws-java-sdk-1.7.4.jar}} file is copied there and the {{sparkbugtest.py}} get's deleted * After the spark-submit has ended the {{aws-java-sdk-1.7.4.jar}} is still in the temporary folder that couldn't be deleted The temp folder in this example was like: C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\ was (Author: jouni): I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. aws-java-sdk-1.7.4.jar can be downloaded from here: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar The input in the bugtest.txt doesn't matter. Example script: {noformat} import sys from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) lines.saveAsTextFile(sys.argv[2]) if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html After the error is thrown and and spark-submit has ended, I take a look at the folder that couldn't be deleted, it has the .jar file inside, for example {{C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\aws-java-sdk-1.7.4.jar}} > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: >
[jira] [Commented] (SPARK-19674) Ignore driver accumulator updates don't belong to the execution when merging all accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-19674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941213#comment-15941213 ] Apache Spark commented on SPARK-19674: -- User 'mallman' has created a pull request for this issue: https://github.com/apache/spark/pull/17418 > Ignore driver accumulator updates don't belong to the execution when merging > all accumulator updates > > > Key: SPARK-19674 > URL: https://issues.apache.org/jira/browse/SPARK-19674 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Minor > Fix For: 2.2.0 > > > In SQLListener.getExecutionMetrics, driver accumulator updates don't belong > to the execution should be ignored when merging all accumulator updates to > prevent NoSuchElementException. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mandar uapdhye updated SPARK-20086: --- Description: original post at [stackoverflow | http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] I get error when working with pyspark window function. here is some example code: {code:title=borderStyle=solid} import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes from pyspark.sql import window sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) df.show() {code} gives: | x|AmtPaid| | 1|2.0| | 1|3.0| | 1|1.0| | 1| -2.0| | 1| -1.0| next, compute cumulative sum {code:title=test.py|borderStyle=solid} win_spec_max = (window.Window .partitionBy(['x']) .rowsBetween(window.Window.unboundedPreceding, 0))) df = df.withColumn('AmtPaidCumSum', sf.sum(sf.col('AmtPaid')).over(win_spec_max)) df.show() {code} gives, | x|AmtPaid|AmtPaidCumSum| | 1|2.0| 2.0| | 1|3.0| 5.0| | 1|1.0| 6.0| | 1| -2.0| 4.0| | 1| -1.0| 3.0| next, compute cumulative max, {code} df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} gives error log {noformat} Py4JJavaError: An error occurred while calling o2609.showString. with traceback: Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """ 317 if isinstance(truncate, bool) and truncate: --> 318 print(self._jdf.showString(n, 20)) 319 else: 320 print(self._jdf.showString(n, int(truncate))) /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( {noformat} but interestingly enough, if i introduce another change before sencond window operation, say inserting a column then it does not give that error: {code} df = df.withColumn('MaxBound', sf.lit(6.)) df.show() {code} | x|AmtPaid|AmtPaidCumSum|MaxBound| | 1|2.0| 2.0| 6.0| | 1|3.0| 5.0| 6.0| | 1|1.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| | 1| -1.0| 3.0| 6.0| {code} #then apply the second window operations df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| | 1|2.0| 2.0| 6.0| 2.0| | 1|3.0| 5.0| 6.0| 5.0| | 1|1.0| 6.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| 6.0| | 1| -1.0| 3.0| 6.0| 6.0| I do not understand this behaviour well, so far so good, but then I try another operation then again get similar error: {code} def _udf_compare_cumsum_sll(x): if x['AmtPaidCumSumMax'] >= x['MaxBound']: output = 0 else: output = x['AmtPaid'] udf_compare_cumsum_sll = sf.udf(_udf_compare_cumsum_sll, sparktypes.FloatType()) df = df.withColumn('AmtPaidAdjusted', udf_compare_cumsum_sll(sf.struct([df[x] for x in df.columns]))) df.show() {code} gives, {noformat} Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n,
[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mandar uapdhye updated SPARK-20086: --- Description: original post at [stackoverflow | http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] I get error when working with pyspark window function. here is some example code: {code:title=borderStyle=solid} import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes from pyspark.sql import window sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) df.show() {code} gives: | x|AmtPaid| | 1|2.0| | 1|3.0| | 1|1.0| | 1| -2.0| | 1| -1.0| next, compute cumulative sum {code:title=test.py|borderStyle=solid} win_spec_max = (window.Window .partitionBy(['x']) .rowsBetween(window.Window.unboundedPreceding, 0))) df = df.withColumn('AmtPaidCumSum', sf.sum(sf.col('AmtPaid')).over(win_spec_max)) df.show() {code} gives, | x|AmtPaid|AmtPaidCumSum| | 1|2.0| 2.0| | 1|3.0| 5.0| | 1|1.0| 6.0| | 1| -2.0| 4.0| | 1| -1.0| 3.0| next, compute cumulative max, {code} df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} gives error log {noformat} Py4JJavaError: An error occurred while calling o2609.showString. with traceback: Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """ 317 if isinstance(truncate, bool) and truncate: --> 318 print(self._jdf.showString(n, 20)) 319 else: 320 print(self._jdf.showString(n, int(truncate))) /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( {noformat} but interestingly enough, if i introduce another change before sencond window operation, say inserting a column then it does not give that error: {code} df = df.withColumn('MaxBound', sf.lit(6.)) df.show() {code} | x|AmtPaid|AmtPaidCumSum|MaxBound| | 1|2.0| 2.0| 6.0| | 1|3.0| 5.0| 6.0| | 1|1.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| | 1| -1.0| 3.0| 6.0| {code} #then apply the second window operations df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| | 1|2.0| 2.0| 6.0| 2.0| | 1|3.0| 5.0| 6.0| 5.0| | 1|1.0| 6.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| 6.0| | 1| -1.0| 3.0| 6.0| 6.0| I do not understand this behaviour well, so far so good, but then I try another operation then again get similar error: {code} def _udf_compare_cumsum_sll(x): if x['AmtPaidCumSumMax'] >= x['MaxBound']: output = 0 else: output = x['AmtPaid'] udf_compare_cumsum_sll = sf.udf(_udf_compare_cumsum_sll, sparktypes.FloatType()) df = df.withColumn('AmtPaidAdjusted', udf_compare_cumsum_sll(sf.struct([df[x] for x in df.columns]))) df.show() {code} gives, Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate)
[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mandar uapdhye updated SPARK-20086: --- Description: original post at [stackoverflow | http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] I get error when working with pyspark window function. here is some example code: {code:title=borderStyle=solid} import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes from pyspark.sql import window sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) df.show() {code} gives: | x|AmtPaid| | 1|2.0| | 1|3.0| | 1|1.0| | 1| -2.0| | 1| -1.0| next, compute cumulative sum {code:title=test.py|borderStyle=solid} win_spec_max = (window.Window .partitionBy(['x']) .rowsBetween(window.Window.unboundedPreceding, 0))) df = df.withColumn('AmtPaidCumSum', sf.sum(sf.col('AmtPaid')).over(win_spec_max)) df.show() {code} gives, | x|AmtPaid|AmtPaidCumSum| | 1|2.0| 2.0| | 1|3.0| 5.0| | 1|1.0| 6.0| | 1| -2.0| 4.0| | 1| -1.0| 3.0| next, compute cumulative max, {code} df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} gives error log Py4JJavaError: An error occurred while calling o2609.showString. with traceback: Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """ 317 if isinstance(truncate, bool) and truncate: --> 318 print(self._jdf.showString(n, 20)) 319 else: 320 print(self._jdf.showString(n, int(truncate))) /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( but interestingly enough, if i introduce another change before sencond window operation, say inserting a column then it does not give that error: {code} df = df.withColumn('MaxBound', sf.lit(6.)) df.show() {code} | x|AmtPaid|AmtPaidCumSum|MaxBound| | 1|2.0| 2.0| 6.0| | 1|3.0| 5.0| 6.0| | 1|1.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| | 1| -1.0| 3.0| 6.0| {code} #then apply the second window operations df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| | 1|2.0| 2.0| 6.0| 2.0| | 1|3.0| 5.0| 6.0| 5.0| | 1|1.0| 6.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| 6.0| | 1| -1.0| 3.0| 6.0| 6.0| I do not understand this behaviour well, so far so good, but then I try another operation then again get similar error: {code} def _udf_compare_cumsum_sll(x): if x['AmtPaidCumSumMax'] >= x['MaxBound']: output = 0 else: output = x['AmtPaid'] udf_compare_cumsum_sll = sf.udf(_udf_compare_cumsum_sll, sparktypes.FloatType()) df = df.withColumn('AmtPaidAdjusted', udf_compare_cumsum_sll(sf.struct([df[x] for x in df.columns]))) df.show() {code} gives, Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """
[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mandar uapdhye updated SPARK-20086: --- Description: original post at [stackoverflow | http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] I get error when working with pyspark window function. here is some example code: {code:title=borderStyle=solid} import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes from pyspark.sql import window sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) df.show() {code} gives: | x|AmtPaid| | 1|2.0| | 1|3.0| | 1|1.0| | 1| -2.0| | 1| -1.0| next, compute cumulative sum {code:title=test.py|borderStyle=solid} win_spec_max = (window.Window .partitionBy(['x']) .rowsBetween(window.Window.unboundedPreceding, 0))) df = df.withColumn('AmtPaidCumSum', sf.sum(sf.col('AmtPaid')).over(win_spec_max)) df.show() {code} gives, +---+---+-+ | x|AmtPaid|AmtPaidCumSum| +---+---+-+ | 1|2.0| 2.0| | 1|3.0| 5.0| | 1|1.0| 6.0| | 1| -2.0| 4.0| | 1| -1.0| 3.0| +---+---+-+ next, compute cumulative max, {code: borderStyle=solid} df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} gives error log Py4JJavaError: An error occurred while calling o2609.showString. with traceback: Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """ 317 if isinstance(truncate, bool) and truncate: --> 318 print(self._jdf.showString(n, 20)) 319 else: 320 print(self._jdf.showString(n, int(truncate))) /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( but interestingly enough, if i introduce another change before sencond window operation, say inserting a column then it does not give that error: {code} df = df.withColumn('MaxBound', sf.lit(6.)) df.show() {code} +---+---+-++ | x|AmtPaid|AmtPaidCumSum|MaxBound| +---+---+-++ | 1|2.0| 2.0| 6.0| | 1|3.0| 5.0| 6.0| | 1|1.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| | 1| -1.0| 3.0| 6.0| +---+---+-++ {code} #then apply the second window operations df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} +---+---+-+++ | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| +---+---+-+++ | 1|2.0| 2.0| 6.0| 2.0| | 1|3.0| 5.0| 6.0| 5.0| | 1|1.0| 6.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| 6.0| | 1| -1.0| 3.0| 6.0| 6.0| +---+---+-+++ I do not understand this behaviour well, so far so good, but then I try another operation then again get similar error: {code} def _udf_compare_cumsum_sll(x): if x['AmtPaidCumSumMax'] >= x['MaxBound']: output = 0 else: output = x['AmtPaid']
[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mandar uapdhye updated SPARK-20086: --- Description: original post at [stackoverflow | http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] I get error when working with pyspark window function. here is some example code: {code:title=borderStyle=solid} import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes from pyspark.sql import window sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) df.show() {code} gives: +---+---+ | x|AmtPaid| +---+---+ | 1|2.0| | 1|3.0| | 1|1.0| | 1| -2.0| | 1| -1.0| +---+---+ next, compute cumulative sum {code:title=test.py|borderStyle=solid} win_spec_max = (window.Window .partitionBy(['x']) .rowsBetween(window.Window.unboundedPreceding, 0))) df = df.withColumn('AmtPaidCumSum', sf.sum(sf.col('AmtPaid')).over(win_spec_max)) df.show() {code} gives, +---+---+-+ | x|AmtPaid|AmtPaidCumSum| +---+---+-+ | 1|2.0| 2.0| | 1|3.0| 5.0| | 1|1.0| 6.0| | 1| -2.0| 4.0| | 1| -1.0| 3.0| +---+---+-+ next, compute cumulative max, {code: borderStyle=solid} df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} gives error log Py4JJavaError: An error occurred while calling o2609.showString. with traceback: Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """ 317 if isinstance(truncate, bool) and truncate: --> 318 print(self._jdf.showString(n, 20)) 319 else: 320 print(self._jdf.showString(n, int(truncate))) /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( but interestingly enough, if i introduce another change before sencond window operation, say inserting a column then it does not give that error: {code} df = df.withColumn('MaxBound', sf.lit(6.)) df.show() {code} +---+---+-++ | x|AmtPaid|AmtPaidCumSum|MaxBound| +---+---+-++ | 1|2.0| 2.0| 6.0| | 1|3.0| 5.0| 6.0| | 1|1.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| | 1| -1.0| 3.0| 6.0| +---+---+-++ {code} #then apply the second window operations df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} +---+---+-+++ | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| +---+---+-+++ | 1|2.0| 2.0| 6.0| 2.0| | 1|3.0| 5.0| 6.0| 5.0| | 1|1.0| 6.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| 6.0| | 1| -1.0| 3.0| 6.0| 6.0| +---+---+-+++ I do not understand this behaviour well, so far so good, but then I try another operation then again get similar error: {code} def _udf_compare_cumsum_sll(x): if x['AmtPaidCumSumMax'] >= x['MaxBound']: output = 0 else:
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987 ] Jouni H edited comment on SPARK-12216 at 3/24/17 9:37 PM: -- I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. aws-java-sdk-1.7.4.jar can be downloaded from here: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar The input in the bugtest.txt doesn't matter. Example script: {noformat} import sys from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) lines.saveAsTextFile(sys.argv[2]) if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html After the error is thrown and and spark-submit has ended, I take a look at the folder that couldn't be deleted, it has the .jar file inside, for example {{C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\aws-java-sdk-1.7.4.jar}} was (Author: jouni): I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. aws-java-sdk-1.7.4.jar can be downloaded from here: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar The input in the bugtest.txt doesn't matter. Example script: {noformat} import sys from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) lines.saveAsTextFile(sys.argv[2]) if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at
[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mandar uapdhye updated SPARK-20086: --- Description: original post at [stackoverflow | http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] I get error when working with pyspark window function. here is some example code: {code:title=test.py|borderStyle=solid} import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes from pyspark.sql import window sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) df.show() {code} gives: +---+---+ | x|AmtPaid| +---+---+ | 1|2.0| | 1|3.0| | 1|1.0| | 1| -2.0| | 1| -1.0| +---+---+ next, compute cumulative sum {code:title=test.py|borderStyle=solid} win_spec_max = (window.Window .partitionBy(['x']) .rowsBetween(window.Window.unboundedPreceding, 0))) df = df.withColumn('AmtPaidCumSum', sf.sum(sf.col('AmtPaid')).over(win_spec_max)) df.show() {code} gives, +---+---+-+ | x|AmtPaid|AmtPaidCumSum| +---+---+-+ | 1|2.0| 2.0| | 1|3.0| 5.0| | 1|1.0| 6.0| | 1| -2.0| 4.0| | 1| -1.0| 3.0| +---+---+-+ next, compute cumulative max, {code: borderStyle=solid} df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() {code} gives error log Py4JJavaError: An error occurred while calling o2609.showString. with traceback: Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """ 317 if isinstance(truncate, bool) and truncate: --> 318 print(self._jdf.showString(n, 20)) 319 else: 320 print(self._jdf.showString(n, int(truncate))) /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( but interestingly enough, if i introduce another change before sencond window operation, say inserting a column then it does not give that error: df = df.withColumn('MaxBound', sf.lit(6.)) df.show() +---+---+-++ | x|AmtPaid|AmtPaidCumSum|MaxBound| +---+---+-++ | 1|2.0| 2.0| 6.0| | 1|3.0| 5.0| 6.0| | 1|1.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| | 1| -1.0| 3.0| 6.0| +---+---+-++ #then apply the second window operations df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() +---+---+-+++ | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| +---+---+-+++ | 1|2.0| 2.0| 6.0| 2.0| | 1|3.0| 5.0| 6.0| 5.0| | 1|1.0| 6.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| 6.0| | 1| -1.0| 3.0| 6.0| 6.0| +---+---+-+++ I do not understand this behaviour well, so far so good, but then I try another operation then again get similar error: def _udf_compare_cumsum_sll(x): if x['AmtPaidCumSumMax'] >= x['MaxBound']: output = 0 else: output =
[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mandar uapdhye updated SPARK-20086: --- Description: original post at [stackoverflow | http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] I get error when working with pyspark window function. here is some example code: {code:title=test.py|borderStyle=solid} import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes from pyspark.sql import window sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) df.show() {code} gives: +---+---+ | x|AmtPaid| +---+---+ | 1|2.0| | 1|3.0| | 1|1.0| | 1| -2.0| | 1| -1.0| +---+---+ next, compute cumulative sum win_spec_max = (window.Window .partitionBy(['x']) .rowsBetween(window.Window.unboundedPreceding, 0))) df = df.withColumn('AmtPaidCumSum', sf.sum(sf.col('AmtPaid')).over(win_spec_max)) df.show() gives, +---+---+-+ | x|AmtPaid|AmtPaidCumSum| +---+---+-+ | 1|2.0| 2.0| | 1|3.0| 5.0| | 1|1.0| 6.0| | 1| -2.0| 4.0| | 1| -1.0| 3.0| +---+---+-+ next, compute cumulative max, df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() gives error log Py4JJavaError: An error occurred while calling o2609.showString. with traceback: Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """ 317 if isinstance(truncate, bool) and truncate: --> 318 print(self._jdf.showString(n, 20)) 319 else: 320 print(self._jdf.showString(n, int(truncate))) /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( but interestingly enough, if i introduce another change before sencond window operation, say inserting a column then it does not give that error: df = df.withColumn('MaxBound', sf.lit(6.)) df.show() +---+---+-++ | x|AmtPaid|AmtPaidCumSum|MaxBound| +---+---+-++ | 1|2.0| 2.0| 6.0| | 1|3.0| 5.0| 6.0| | 1|1.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| | 1| -1.0| 3.0| 6.0| +---+---+-++ #then apply the second window operations df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() +---+---+-+++ | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| +---+---+-+++ | 1|2.0| 2.0| 6.0| 2.0| | 1|3.0| 5.0| 6.0| 5.0| | 1|1.0| 6.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| 6.0| | 1| -1.0| 3.0| 6.0| 6.0| +---+---+-+++ I do not understand this behaviour well, so far so good, but then I try another operation then again get similar error: def _udf_compare_cumsum_sll(x): if x['AmtPaidCumSumMax'] >= x['MaxBound']: output = 0 else: output = x['AmtPaid'] udf_compare_cumsum_sll = sf.udf(_udf_compare_cumsum_sll,
[jira] [Created] (SPARK-20086) issue with pyspark 2.1.0 window function
mandar uapdhye created SPARK-20086: -- Summary: issue with pyspark 2.1.0 window function Key: SPARK-20086 URL: https://issues.apache.org/jira/browse/SPARK-20086 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: mandar uapdhye original post at [stackoverflow | http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] I get error when working with pyspark window function. here is some example code: import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes from pyspark.sql import window sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) df.show() gives: +---+---+ | x|AmtPaid| +---+---+ | 1|2.0| | 1|3.0| | 1|1.0| | 1| -2.0| | 1| -1.0| +---+---+ next, compute cumulative sum win_spec_max = (window.Window .partitionBy(['x']) .rowsBetween(window.Window.unboundedPreceding, 0))) df = df.withColumn('AmtPaidCumSum', sf.sum(sf.col('AmtPaid')).over(win_spec_max)) df.show() gives, +---+---+-+ | x|AmtPaid|AmtPaidCumSum| +---+---+-+ | 1|2.0| 2.0| | 1|3.0| 5.0| | 1|1.0| 6.0| | 1| -2.0| 4.0| | 1| -1.0| 3.0| +---+---+-+ next, compute cumulative max, df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() gives error log Py4JJavaError: An error occurred while calling o2609.showString. with traceback: Py4JJavaErrorTraceback (most recent call last) in () > 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """ 317 if isinstance(truncate, bool) and truncate: --> 318 print(self._jdf.showString(n, 20)) 319 else: 320 print(self._jdf.showString(n, int(truncate))) /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( but interestingly enough, if i introduce another change before sencond window operation, say inserting a column then it does not give that error: df = df.withColumn('MaxBound', sf.lit(6.)) df.show() +---+---+-++ | x|AmtPaid|AmtPaidCumSum|MaxBound| +---+---+-++ | 1|2.0| 2.0| 6.0| | 1|3.0| 5.0| 6.0| | 1|1.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| | 1| -1.0| 3.0| 6.0| +---+---+-++ #then apply the second window operations df = df.withColumn('AmtPaidCumSumMax', sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) df.show() +---+---+-+++ | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| +---+---+-+++ | 1|2.0| 2.0| 6.0| 2.0| | 1|3.0| 5.0| 6.0| 5.0| | 1|1.0| 6.0| 6.0| 6.0| | 1| -2.0| 4.0| 6.0| 6.0| | 1| -1.0| 3.0| 6.0| 6.0| +---+---+-+++ I do not understand this behaviour well, so far so good, but then I try another operation then again get similar error: def _udf_compare_cumsum_sll(x): if x['AmtPaidCumSumMax'] >= x['MaxBound']:
[jira] [Assigned] (SPARK-20075) Support classifier, packaging in Maven coordinates
[ https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20075: Assignee: Sean Owen (was: Apache Spark) > Support classifier, packaging in Maven coordinates > -- > > Key: SPARK-20075 > URL: https://issues.apache.org/jira/browse/SPARK-20075 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > Currently, it's possible to add dependencies to an app using its Maven > coordinates on the command line: {{group:artifact:version}}. However, really > Maven coordinates are 5-dimensional: > {{group:artifact:packaging:classifier:version}}. In some rare but real cases > it's important to be able to specify the classifier. And while we're at it > why not try to support packaging? > I have a WIP PR that I'll post soon. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20075) Support classifier, packaging in Maven coordinates
[ https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941155#comment-15941155 ] Apache Spark commented on SPARK-20075: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/17416 > Support classifier, packaging in Maven coordinates > -- > > Key: SPARK-20075 > URL: https://issues.apache.org/jira/browse/SPARK-20075 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > Currently, it's possible to add dependencies to an app using its Maven > coordinates on the command line: {{group:artifact:version}}. However, really > Maven coordinates are 5-dimensional: > {{group:artifact:packaging:classifier:version}}. In some rare but real cases > it's important to be able to specify the classifier. And while we're at it > why not try to support packaging? > I have a WIP PR that I'll post soon. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20075) Support classifier, packaging in Maven coordinates
[ https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20075: Assignee: Apache Spark (was: Sean Owen) > Support classifier, packaging in Maven coordinates > -- > > Key: SPARK-20075 > URL: https://issues.apache.org/jira/browse/SPARK-20075 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > Currently, it's possible to add dependencies to an app using its Maven > coordinates on the command line: {{group:artifact:version}}. However, really > Maven coordinates are 5-dimensional: > {{group:artifact:packaging:classifier:version}}. In some rare but real cases > it's important to be able to specify the classifier. And while we're at it > why not try to support packaging? > I have a WIP PR that I'll post soon. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18484) case class datasets - ability to specify decimal precision and scale
[ https://issues.apache.org/jira/browse/SPARK-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941138#comment-15941138 ] Leif Warner commented on SPARK-18484: - In the example on the ticket, the precision on the input BigDecimals is set to 2. Spark is discarding that information when it creates the tables, and setting the scale to 18 in the schema. When it converts from table form to BigDecimal, it respects the precision specified in the schema in the created BigDecimals. > case class datasets - ability to specify decimal precision and scale > > > Key: SPARK-18484 > URL: https://issues.apache.org/jira/browse/SPARK-18484 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0, 2.0.1 >Reporter: Damian Momot > > Currently when using decimal type (BigDecimal in scala case class) there's no > way to enforce precision and scale. This is quite critical when saving data - > regarding space usage and compatibility with external systems (for example > Hive table) because spark saves data as Decimal(38,18) > {code} > case class TestClass(id: String, money: BigDecimal) > val testDs = spark.createDataset(Seq( > TestClass("1", BigDecimal("22.50")), > TestClass("2", BigDecimal("500.66")) > )) > testDs.printSchema() > {code} > {code} > root > |-- id: string (nullable = true) > |-- money: decimal(38,18) (nullable = true) > {code} > Workaround is to convert dataset to dataframe before saving and manually cast > to specific decimal scale/precision: > {code} > import org.apache.spark.sql.types.DecimalType > val testDf = testDs.toDF() > testDf > .withColumn("money", testDf("money").cast(DecimalType(10,2))) > .printSchema() > {code} > {code} > root > |-- id: string (nullable = true) > |-- money: decimal(10,2) (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987 ] Jouni H edited comment on SPARK-12216 at 3/24/17 8:59 PM: -- I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. aws-java-sdk-1.7.4.jar can be downloaded from here: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar The input in the bugtest.txt doesn't matter. Example script: {noformat} import sys from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) lines.saveAsTextFile(sys.argv[2]) if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html was (Author: jouni): I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. The input in the bugtest.txt doesn't matter. Example script: {noformat} import sys from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) lines.saveAsTextFile(sys.argv[2]) if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at >
[jira] [Commented] (SPARK-19408) cardinality estimation involving two columns of the same table
[ https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941117#comment-15941117 ] Apache Spark commented on SPARK-19408: -- User 'ron8hu' has created a pull request for this issue: https://github.com/apache/spark/pull/17415 > cardinality estimation involving two columns of the same table > -- > > Key: SPARK-19408 > URL: https://issues.apache.org/jira/browse/SPARK-19408 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu > > In SPARK-17075, we estimate cardinality of predicate expression "column (op) > literal", where op is =, <, <=, >, or >=. In SQL queries, we also see > predicate expressions involving two columns such as "column-1 (op) column-2" > where column-1 and column-2 belong to same table. Note that, if column-1 and > column-2 belong to different tables, then it is a join operator's work, NOT a > filter operator's work. > In this jira, we want to estimate the filter factor of predicate expressions > involving two columns of same table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941116#comment-15941116 ] DB Tsai commented on SPARK-17137: - [~sethah] We can do the performance test during the QA time. Thanks. > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19408) cardinality estimation involving two columns of the same table
[ https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19408: Assignee: Apache Spark > cardinality estimation involving two columns of the same table > -- > > Key: SPARK-19408 > URL: https://issues.apache.org/jira/browse/SPARK-19408 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu >Assignee: Apache Spark > > In SPARK-17075, we estimate cardinality of predicate expression "column (op) > literal", where op is =, <, <=, >, or >=. In SQL queries, we also see > predicate expressions involving two columns such as "column-1 (op) column-2" > where column-1 and column-2 belong to same table. Note that, if column-1 and > column-2 belong to different tables, then it is a join operator's work, NOT a > filter operator's work. > In this jira, we want to estimate the filter factor of predicate expressions > involving two columns of same table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-17137: --- Assignee: Seth Hendrickson > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19408) cardinality estimation involving two columns of the same table
[ https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19408: Assignee: (was: Apache Spark) > cardinality estimation involving two columns of the same table > -- > > Key: SPARK-19408 > URL: https://issues.apache.org/jira/browse/SPARK-19408 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu > > In SPARK-17075, we estimate cardinality of predicate expression "column (op) > literal", where op is =, <, <=, >, or >=. In SQL queries, we also see > predicate expressions involving two columns such as "column-1 (op) column-2" > where column-1 and column-2 belong to same table. Note that, if column-1 and > column-2 belong to different tables, then it is a join operator's work, NOT a > filter operator's work. > In this jira, we want to estimate the filter factor of predicate expressions > involving two columns of same table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941114#comment-15941114 ] Seth Hendrickson commented on SPARK-17137: -- I can make a PR for using this inside the MLOR code, but I probably won't have time to do performance tests within the next couple of days (since code freeze has already passed). [~dbtsai] Do you think we need to do performance tests before this patch goes in? > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941106#comment-15941106 ] DB Tsai commented on SPARK-17137: - Ping! SPARK-17471 is merged. Anyone interested in making this by 2.2.0 release? > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941104#comment-15941104 ] Sean Owen commented on SPARK-19476: --- I don't think in general you're expected to be able to do this safely. Why would you do this asynchronously or with more partitions, simply? > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - > TungstenAggregationIterator uses a ThreadLocal variable that returns null > when called from a thread other than the original thread that got the > iterator from Spark. From examining the code, this does not appear to differ > between recent Spark versions. > However, this limitation is specific to TungstenAggregationIterator, and not > documented, as far as I'm aware. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-17471. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15628 [https://github.com/apache/spark/pull/15628] > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > Fix For: 2.2.0 > > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941079#comment-15941079 ] Lucy Yu commented on SPARK-19476: - I believe we've seen a similar issue raised here: https://github.com/memsql/memsql-spark-connector/issues/31 It seems that {code} sqlDf.foreachPartition(partition => { new Thread(new Runnable { override def run(): Unit = { for (row <- partition) { // do nothing here, just force the partition to be fully iterated over } } }).start() }) {code} results in {code} java.lang.NullPointerException at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:287) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:573) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:86) at org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:161) at org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:148) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(UnknownSource) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextPartition(WindowExec.scala:361) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:391) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:290) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(UnknownSource) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at com.cainc.data.etl.lake.TestMartOutput$$anonfun$main$3$$anon$1.run(TestMartOutput.scala:42) at java.lang.Thread.run(Thread.java:745) {code} > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - >
[jira] [Assigned] (SPARK-20085) Configurable mesos labels for executors
[ https://issues.apache.org/jira/browse/SPARK-20085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20085: Assignee: (was: Apache Spark) > Configurable mesos labels for executors > --- > > Key: SPARK-20085 > URL: https://issues.apache.org/jira/browse/SPARK-20085 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Kalvin Chau >Priority: Minor > > Add spark.mesos.task.labels configuration option to add mesos key:value > labels to the executor. > "k1:v1,k2:v2" as the format, colons separating key-value and commas to list > out more than one. > See discussion on label at https://github.com/apache/spark/pull/17404 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20085) Configurable mesos labels for executors
[ https://issues.apache.org/jira/browse/SPARK-20085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20085: Assignee: Apache Spark > Configurable mesos labels for executors > --- > > Key: SPARK-20085 > URL: https://issues.apache.org/jira/browse/SPARK-20085 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Kalvin Chau >Assignee: Apache Spark >Priority: Minor > > Add spark.mesos.task.labels configuration option to add mesos key:value > labels to the executor. > "k1:v1,k2:v2" as the format, colons separating key-value and commas to list > out more than one. > See discussion on label at https://github.com/apache/spark/pull/17404 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20085) Configurable mesos labels for executors
[ https://issues.apache.org/jira/browse/SPARK-20085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941065#comment-15941065 ] Apache Spark commented on SPARK-20085: -- User 'kalvinnchau' has created a pull request for this issue: https://github.com/apache/spark/pull/17413 > Configurable mesos labels for executors > --- > > Key: SPARK-20085 > URL: https://issues.apache.org/jira/browse/SPARK-20085 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Kalvin Chau >Priority: Minor > > Add spark.mesos.task.labels configuration option to add mesos key:value > labels to the executor. > "k1:v1,k2:v2" as the format, colons separating key-value and commas to list > out more than one. > See discussion on label at https://github.com/apache/spark/pull/17404 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20085) Configurable mesos labels for executors
Kalvin Chau created SPARK-20085: --- Summary: Configurable mesos labels for executors Key: SPARK-20085 URL: https://issues.apache.org/jira/browse/SPARK-20085 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 2.1.0 Reporter: Kalvin Chau Priority: Minor Add spark.mesos.task.labels configuration option to add mesos key:value labels to the executor. "k1:v1,k2:v2" as the format, colons separating key-value and commas to list out more than one. See discussion on label at https://github.com/apache/spark/pull/17404 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19911) Add builder interface for Kinesis DStreams
[ https://issues.apache.org/jira/browse/SPARK-19911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-19911. - Resolution: Fixed Assignee: Adam Budde Fix Version/s: 2.2.0 Target Version/s: 2.2.0 Resolved with https://github.com/apache/spark/pull/17250 > Add builder interface for Kinesis DStreams > -- > > Key: SPARK-19911 > URL: https://issues.apache.org/jira/browse/SPARK-19911 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Adam Budde >Assignee: Adam Budde >Priority: Minor > Fix For: 2.2.0 > > > The ```KinesisUtils.createStream()``` interface for creating Kinesis-based > DStreams is quite brittle and requires adding a combinatorial number of > overrides whenever another optional configuration parameter is added. This > makes incorporating a lot of additional features supported by the Kinesis > Client Library such as per-service authorization unfeasible. This interface > should be replaced by a builder pattern class > (https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater > extensibility. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20084) Remove internal.metrics.updatedBlockStatuses accumulator from history files
Ryan Blue created SPARK-20084: - Summary: Remove internal.metrics.updatedBlockStatuses accumulator from history files Key: SPARK-20084 URL: https://issues.apache.org/jira/browse/SPARK-20084 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 2.1.0 Reporter: Ryan Blue History files for large jobs can be hundreds of GB. These history files take too much space and create a backlog on the history server. Most of the size is from Accumulables in SparkListenerTaskEnd. The largest accumulable is internal.metrics.updatedBlockStatuses, which has a small update (the blocks that were changed) but a huge value (all known blocks). Nothing currently uses the accumulator value or update, so it is safe to remove it. Information for any block updated during a task is also recorded under Task Metrics / Updated Blocks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20084) Remove internal.metrics.updatedBlockStatuses accumulator from history files
[ https://issues.apache.org/jira/browse/SPARK-20084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20084: Assignee: Apache Spark > Remove internal.metrics.updatedBlockStatuses accumulator from history files > --- > > Key: SPARK-20084 > URL: https://issues.apache.org/jira/browse/SPARK-20084 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.1.0 >Reporter: Ryan Blue >Assignee: Apache Spark > > History files for large jobs can be hundreds of GB. These history files take > too much space and create a backlog on the history server. > Most of the size is from Accumulables in SparkListenerTaskEnd. The largest > accumulable is internal.metrics.updatedBlockStatuses, which has a small > update (the blocks that were changed) but a huge value (all known blocks). > Nothing currently uses the accumulator value or update, so it is safe to > remove it. Information for any block updated during a task is also recorded > under Task Metrics / Updated Blocks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20084) Remove internal.metrics.updatedBlockStatuses accumulator from history files
[ https://issues.apache.org/jira/browse/SPARK-20084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941012#comment-15941012 ] Apache Spark commented on SPARK-20084: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/17412 > Remove internal.metrics.updatedBlockStatuses accumulator from history files > --- > > Key: SPARK-20084 > URL: https://issues.apache.org/jira/browse/SPARK-20084 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.1.0 >Reporter: Ryan Blue > > History files for large jobs can be hundreds of GB. These history files take > too much space and create a backlog on the history server. > Most of the size is from Accumulables in SparkListenerTaskEnd. The largest > accumulable is internal.metrics.updatedBlockStatuses, which has a small > update (the blocks that were changed) but a huge value (all known blocks). > Nothing currently uses the accumulator value or update, so it is safe to > remove it. Information for any block updated during a task is also recorded > under Task Metrics / Updated Blocks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20084) Remove internal.metrics.updatedBlockStatuses accumulator from history files
[ https://issues.apache.org/jira/browse/SPARK-20084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20084: Assignee: (was: Apache Spark) > Remove internal.metrics.updatedBlockStatuses accumulator from history files > --- > > Key: SPARK-20084 > URL: https://issues.apache.org/jira/browse/SPARK-20084 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.1.0 >Reporter: Ryan Blue > > History files for large jobs can be hundreds of GB. These history files take > too much space and create a backlog on the history server. > Most of the size is from Accumulables in SparkListenerTaskEnd. The largest > accumulable is internal.metrics.updatedBlockStatuses, which has a small > update (the blocks that were changed) but a huge value (all known blocks). > Nothing currently uses the accumulator value or update, so it is safe to > remove it. Information for any block updated during a task is also recorded > under Task Metrics / Updated Blocks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987 ] Jouni H commented on SPARK-12216: - I was able to reproduce this bug on Windows with the latest spark version: spark-2.1.0-bin-hadoop2.7 This bug happens for me when I include --jars for spark-submit AND use saveAsTextOut on the script. Example scenarios: * ERROR when include --jars AND use saveAsTextFile * Works when use saveAsTextFile, but don't use any --jars on command line * Works when you include --jars on command line but don't use saveAsTextOut (comment out) Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar sparkbugtest.py bugtest.txt ./output/test1/}} The script here doesn't need the --jars file, but if you include it on the command line, it causes the shutdown bug. The input in the bugtest.txt doesn't matter. Example script: {noformat} import sys from pyspark.sql import SparkSession def main(): # Initialize the spark context. spark = SparkSession\ .builder\ .appName("SparkParseLogTest")\ .getOrCreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) lines.saveAsTextFile(sys.argv[2]) if __name__ == "__main__": main() {noformat} I also use winutils.exe as mentioned here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940960#comment-15940960 ] yuhao yang commented on SPARK-20082: Yes, that's one of the things that we should improve for LDA. If you're interested in working on the issue, could you please first share some rough design, given the complexity from both EM and Online optimizers and models. > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940897#comment-15940897 ] gagan taneja commented on SPARK-19705: -- Sandy Would you be able to help with this bug. I saw you had filed earlier bug for HDFS Cache support to RDD and this is relatively minor change. We have been running this code in production and we are able to achieve major performance boost Below is the reference to earlier bug filed by you https://issues.apache.org/jira/browse/SPARK-1767 > Preferred location supporting HDFS Cache for FileScanRDD > > > Key: SPARK-19705 > URL: https://issues.apache.org/jira/browse/SPARK-19705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja > > Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating > preferredLocations, FileScanRDD do not take into account HDFS cache while > calculating preferredLocations > The enhancement can be easily implemented for large files where FilePartition > only contains single HDFS file > The enhancement will also result in significant performance improvement for > cached hdfs partitions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940796#comment-15940796 ] Ian Maloney commented on SPARK-12664: - Does anyone know which versions this will make it into? Also, when it might make it into a release? > Expose raw prediction scores in MultilayerPerceptronClassificationModel > --- > > Key: SPARK-12664 > URL: https://issues.apache.org/jira/browse/SPARK-12664 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Robert Dodier >Assignee: Weichen Xu > > In > org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, > there isn't any way to get raw prediction scores; only an integer output > (from 0 to #classes - 1) is available via the `predict` method. > `mplModel.predict` is called within the class to get the raw score, but > `mlpModel` is private so that isn't available to outside callers. > The raw score is useful when the user wants to interpret the classifier > output as a probability. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major
Seth Hendrickson created SPARK-20083: Summary: Change matrix toArray to not create a new array when matrix is already column major Key: SPARK-20083 URL: https://issues.apache.org/jira/browse/SPARK-20083 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: Seth Hendrickson Priority: Minor {{toArray}} always creates a new array in column major format, even when the resulting array is the same as the backing values. We should change this to just return a reference to the values array when it is already column major. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20082: -- Component/s: (was: MLlib) ML > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20082: -- Issue Type: New Feature (was: Wish) > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary
[ https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940672#comment-15940672 ] Stéphane Collot edited comment on SPARK-17557 at 3/24/17 4:37 PM: -- Hi, I got a similar issue with Spark 2.0.0 and it appears that I had a corrupted partitioned table: I had some a column with different column types inside different partitions. It happened because I was writing partitions manually inside subfolders. So reading a specific partition was working, but reading the entire table was giving this PlainValuesDictionary exception on a df.show() or on df.sort('column').show() So it was my fault, but Spark could check if the schema in each partition is the same. Cheers, Stéphane was (Author: scollot): Hi, I got a similar issue with Spark 2.0.0 and it appears that I had partitioned table, and I had some different columns types inside some partitions because I was writing those manually inside the subfolders of the partitions. So reading a specific partition was working, but reading the entire table was giving this PlainValuesDictionary exception on a df.show() or on df.sort('column').show() So it was my fault, but Spark could check if the schema in each partition is the same. Cheers, Stéphane > SQL query on parquet table java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary > - > > Key: SPARK-17557 > URL: https://issues.apache.org/jira/browse/SPARK-17557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov > > Working on 1.6.2, broken on 2.0 > {code} > select * from logs.a where year=2016 and month=9 and day=14 limit 100 > {code} > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary
[ https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940672#comment-15940672 ] Stéphane Collot commented on SPARK-17557: - Hi, I got a similar issue with Spark 2.0.0 and it appears that I had partitioned table, and I had some different columns types inside some partitions because I was writing those manually inside the subfolders of the partitions. So reading a specific partition was working, but reading the entire table was giving this PlainValuesDictionary exception on a df.show() or on df.sort('column').show() So it was my fault, but Spark could check if the schema in each partition is the same. Cheers, Stéphane > SQL query on parquet table java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary > - > > Key: SPARK-17557 > URL: https://issues.apache.org/jira/browse/SPARK-17557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov > > Working on 1.6.2, broken on 2.0 > {code} > select * from logs.a where year=2016 and month=9 and day=14 limit 100 > {code} > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)
[ https://issues.apache.org/jira/browse/SPARK-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940654#comment-15940654 ] Don Drake commented on SPARK-16745: --- I just came across the same exception running Spark 2.1.0 on my Mac, running a few spark-shell commands on a tiny dataset that previously worked just fine. But, I never got a result, just the timeout exceptions. The issue is that today I'm running them on a corporate VPN, with proxy setting enabled, and the IP address the driver is using is my local (Wifi) address that the proxy server cannot connect to. This took a while to figure out, but I added {{--conf spark.driver.host=127.0.0.1}} to my command-line and that forced all local networking between driver and executors (to bypass the proxy server) and the query came back in the expected amount of time. > Spark job completed however have to wait for 13 mins (data size is small) > - > > Key: SPARK-16745 > URL: https://issues.apache.org/jira/browse/SPARK-16745 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.1 > Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014 >Reporter: Joe Chong >Priority: Minor > > I submitted a job in scala spark shell to show a DataFrame. The data size is > about 43K. The job was successful in the end, but took more than 13 minutes > to resolve. Upon checking the log, there's multiple exception raised on > "Failed to check existence of class" with a java.net.connectionexpcetion > message indicating timeout trying to connect to the port 52067, the repl port > that Spark setup. Please assist to troubleshoot. Thanks. > Started Spark in standalone mode > $ spark-shell --driver-memory 5g --master local[*] > 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong > 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: > joechong > 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(joechong); users > with modify permissions: Set(joechong) > 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server > 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/07/26 21:05:30 INFO server.AbstractConnector: Started > SocketConnector@0.0.0.0:52067 > 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class > server' on port 52067. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) > Type in expressions to have them evaluated. > Type :help for more information. > 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1 > 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong > 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: > joechong > 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(joechong); users > with modify permissions: Set(joechong) > 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' > on port 52072. > 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started > 16/07/26 21:05:35 INFO Remoting: Starting remoting > 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074] > 16/07/26 21:05:35 INFO util.Utils: Successfully started service > 'sparkDriverActorSystem' on port 52074. > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster > 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at > /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6 > 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity > 3.4 GB > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator > 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/07/26 21:05:36 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:4040 > 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on > port 4040. > 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at > http://10.199.29.218:4040 > 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host > localhost > 16/07/26 21:05:36 INFO
[jira] [Commented] (SPARK-7200) Tungsten test suites should fail if memory leak is detected
[ https://issues.apache.org/jira/browse/SPARK-7200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940604#comment-15940604 ] Jose Soltren commented on SPARK-7200: - [~joshrosen] please have a look at my PR. Is this what you had in mind? Was there a different approach you were considering? Did you have any other tests in mind? I imagine the code has changed some since April of 2015. I didn't see too many methods exposed by @VisibleForTesting. > Tungsten test suites should fail if memory leak is detected > --- > > Key: SPARK-7200 > URL: https://issues.apache.org/jira/browse/SPARK-7200 > Project: Spark > Issue Type: New Feature > Components: SQL, Tests >Reporter: Reynold Xin > > We should be able to detect whether there are unreturned memory after each > suite and fail the suite. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table
[ https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940596#comment-15940596 ] Everett Anderson commented on SPARK-20073: -- Doesn't the join criteria convey that the user doesn't want a Cartesian product, though? Ideally, the fix would be to not do Cartesian product when 2 tables have a join criteria that isn't a constant of 'true'. > Unexpected Cartesian product when using eqNullSafe in join with a derived > table > --- > > Key: SPARK-20073 > URL: https://issues.apache.org/jira/browse/SPARK-20073 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.2, 2.1.0 >Reporter: Everett Anderson > Labels: correctness > > It appears that if you try to join tables A and B when B is derived from A > and you use the eqNullSafe / <=> operator for the join condition, Spark > performs a Cartesian product. > However, if you perform the join on tables of the same data when they don't > have a relationship, the expected non-Cartesian product join occurs. > {noformat} > // Create some fake data. > import org.apache.spark.sql.Row > import org.apache.spark.sql.Dataset > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions > val peopleRowsRDD = sc.parallelize(Seq( > Row("Fred", 8, 1), > Row("Fred", 8, 2), > Row(null, 10, 3), > Row(null, 10, 4), > Row("Amy", 12, 5), > Row("Amy", 12, 6))) > > val peopleSchema = StructType(Seq( > StructField("name", StringType, nullable = true), > StructField("group", IntegerType, nullable = true), > StructField("data", IntegerType, nullable = true))) > > val people = spark.createDataFrame(peopleRowsRDD, peopleSchema) > people.createOrReplaceTempView("people") > scala> people.show > ++-++ > |name|group|data| > ++-++ > |Fred|8| 1| > |Fred|8| 2| > |null| 10| 3| > |null| 10| 4| > | Amy| 12| 5| > | Amy| 12| 6| > ++-++ > // Now create a derived table from that table. It doesn't matter much what. > val variantCounts = spark.sql("select name, count(distinct(name, group, > data)) as variant_count from people group by name having variant_count > 1") > variantCounts.show > ++-+ > > |name|variant_count| > ++-+ > |Fred|2| > |null|2| > | Amy|2| > ++-+ > // Now try an inner join using the regular equalTo that drops nulls. This > works fine. > val innerJoinEqualTo = variantCounts.join(people, > variantCounts("name").equalTo(people("name"))) > innerJoinEqualTo.show > ++-++-++ > > |name|variant_count|name|group|data| > ++-++-++ > |Fred|2|Fred|8| 1| > |Fred|2|Fred|8| 2| > | Amy|2| Amy| 12| 5| > | Amy|2| Amy| 12| 6| > ++-++-++ > // Okay now lets switch to the <=> operator > // > // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error > like > // "Cartesian joins could be prohibitively expensive and are disabled by > default. To explicitly enable them, please set spark.sql.crossJoin.enabled = > true;" > // > // if you have enabled them, you'll get the table below. > // > // However, we really don't want or expect a Cartesian product! > val innerJoinSqlNullSafeEqOp = variantCounts.join(people, > variantCounts("name")<=>(people("name"))) > innerJoinSqlNullSafeEqOp.show > ++-++-++ > > |name|variant_count|name|group|data| > ++-++-++ > |Fred|2|Fred|8| 1| > |Fred|2|Fred|8| 2| > |Fred|2|null| 10| 3| > |Fred|2|null| 10| 4| > |Fred|2| Amy| 12| 5| > |Fred|2| Amy| 12| 6| > |null|2|Fred|8| 1| > |null|2|Fred|8| 2| > |null|2|null| 10| 3| > |null|2|null| 10| 4| > |null|2| Amy| 12| 5| > |null|2| Amy| 12| 6| > | Amy|2|Fred|8| 1| > | Amy|2|Fred|8| 2| > | Amy|2|null| 10| 3| > | Amy|2|null| 10| 4| > | Amy|2| Amy| 12| 5| > | Amy|2| Amy| 12| 6| > ++-++-++ > // Okay, let's try to construct the exact same variantCount table manually > // so it has no relationship to the original. > val variantCountRowsRDD = sc.parallelize(Seq( > Row("Fred", 2), > Row(null, 2), > Row("Amy", 2))) > > val
[jira] [Assigned] (SPARK-15040) PySpark impl for ml.feature.Imputer
[ https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-15040: -- Assignee: Nick Pentreath > PySpark impl for ml.feature.Imputer > --- > > Key: SPARK-15040 > URL: https://issues.apache.org/jira/browse/SPARK-15040 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Assignee: Nick Pentreath >Priority: Minor > Fix For: 2.2.0 > > > PySpark impl for ml.feature.Imputer. > This need to wait until PR for SPARK-13568 gets merged. > https://github.com/apache/spark/pull/11601 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15040) PySpark impl for ml.feature.Imputer
[ https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-15040. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17316 [https://github.com/apache/spark/pull/17316] > PySpark impl for ml.feature.Imputer > --- > > Key: SPARK-15040 > URL: https://issues.apache.org/jira/browse/SPARK-15040 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Priority: Minor > Fix For: 2.2.0 > > > PySpark impl for ml.feature.Imputer. > This need to wait until PR for SPARK-13568 gets merged. > https://github.com/apache/spark/pull/11601 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev updated SPARK-20080: --- Description: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd,foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. foreachPartition action not executed. Expected result: Throw java.io.NotSerializableException description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get exception -> {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) at ReproduceBugMain$.main(ReproduceBugMain.scala:14) at ReproduceBugMain.main(ReproduceBugMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Caused by: java.io.NotSerializableException: ReproduceBugMain$ Serialization stack: - object not serializable (class: ReproduceBugMain$, value: ReproduceBugMain$@3935e9a8) - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: $outer, type: interface TraitWithMethod) - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) ... 18 more {code} On Github can be found 2 commit. 1 initial i add link on it(this one contain sptreaming example). and Last one with batch job example was: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd,foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. foreachPartition action not executed. Expected result: Throw java.io.NotSerializableException description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get right exception. {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at
[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev updated SPARK-20080: --- Description: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd,foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. foreachPartition action not executed. Expected result: Throw java.io.NotSerializableException description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get right exception. {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) at ReproduceBugMain$.main(ReproduceBugMain.scala:14) at ReproduceBugMain.main(ReproduceBugMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Caused by: java.io.NotSerializableException: ReproduceBugMain$ Serialization stack: - object not serializable (class: ReproduceBugMain$, value: ReproduceBugMain$@3935e9a8) - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: $outer, type: interface TraitWithMethod) - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) ... 18 more {code} On Github can be found 2 commit. 1 initial i add link on it(this one contain sptreaming example). and Last one with batch job example was: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd,foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. foreachPartition action not executed. Expected result: Caused by: java.io.NotSerializableException: ReproduceBugMain$ description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get right exception. {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at
[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev updated SPARK-20080: --- Description: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd,foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. foreachPartition action not executed. Expected result: Caused by: java.io.NotSerializableException: ReproduceBugMain$ description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get right exception. {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) at ReproduceBugMain$.main(ReproduceBugMain.scala:14) at ReproduceBugMain.main(ReproduceBugMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Caused by: java.io.NotSerializableException: ReproduceBugMain$ Serialization stack: - object not serializable (class: ReproduceBugMain$, value: ReproduceBugMain$@3935e9a8) - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: $outer, type: interface TraitWithMethod) - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) ... 18 more {code} On Github can be found 2 commit. 1 initial i add link on it(this one contain sptreaming example). and Last one with batch job example was: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd,foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. Foraech action not executed. Expected result: Caused by: java.io.NotSerializableException: ReproduceBugMain$ description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get right exception. {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at
[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev updated SPARK-20080: --- Description: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd,foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. Foraech action not executed. Expected result: Caused by: java.io.NotSerializableException: ReproduceBugMain$ description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get right exception. {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) at ReproduceBugMain$.main(ReproduceBugMain.scala:14) at ReproduceBugMain.main(ReproduceBugMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Caused by: java.io.NotSerializableException: ReproduceBugMain$ Serialization stack: - object not serializable (class: ReproduceBugMain$, value: ReproduceBugMain$@3935e9a8) - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: $outer, type: interface TraitWithMethod) - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) ... 18 more {code} On Github can be found 2 commit. 1 initial i add link on it(this one contain sptreaming example). and Last one with batch job example was: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd,foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. Forech action notexecuted. Expected result: Caused by: java.io.NotSerializableException: ReproduceBugMain$ description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get right exception. {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at
[jira] [Created] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
Mathieu D created SPARK-20082: - Summary: Incremental update of LDA model, by adding initialModel as start point Key: SPARK-20082 URL: https://issues.apache.org/jira/browse/SPARK-20082 Project: Spark Issue Type: Wish Components: MLlib Affects Versions: 2.1.0 Reporter: Mathieu D Some mllib models support an initialModel to start from and update it incrementally with new data. >From what I understand of OnlineLDAOptimizer, it is possible to incrementally >update an existing model with batches of new documents. I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8586) SQL add jar command does not work well with Scala REPL
[ https://issues.apache.org/jira/browse/SPARK-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8586. -- Resolution: Duplicate I think this and several JIRAs are just the same issue: add jar vs --jars and what classloader ends up with the classes. > SQL add jar command does not work well with Scala REPL > -- > > Key: SPARK-8586 > URL: https://issues.apache.org/jira/browse/SPARK-8586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. > So, SerDe added through add jar command may not be loaded in the context > class loader when we lookup the table. > For example, the following code will fail when we try to show the table. > {code} > hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar") > hive.sql("drop table if exists jsonTable") > hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE > 'org.apache.hive.hcatalog.data.JsonSerDe'") > hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", > "val").insertInto("jsonTable") > hive.table("jsonTable").show > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20019) spark can not load alluxio fileSystem after adding jar
[ https://issues.apache.org/jira/browse/SPARK-20019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20019. --- Resolution: Duplicate > spark can not load alluxio fileSystem after adding jar > -- > > Key: SPARK-20019 > URL: https://issues.apache.org/jira/browse/SPARK-20019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: roncenzhao > Attachments: exception_stack.png > > > The follwing sql cannot load alluxioSystem and it throws > `ClassNotFoundException`. > ``` > add jar /xxx/xxx/alluxioxxx.jar; > set fs.alluxio.impl=alluxio.hadoop.FileSystem; > select * from alluxionTbl; > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8586) SQL add jar command does not work well with Scala REPL
[ https://issues.apache.org/jira/browse/SPARK-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940185#comment-15940185 ] Takeshi Yamamuro commented on SPARK-8586: - This issue still happens in v2.1.0 and the master. {code} scala> sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar") scala> sql("drop table if exists jsonTable") scala> sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'") scala> spark.range(100).selectExpr("CAST(id AS INT) AS key", "'xxx' AS val").write.insertInto("jsonTable") scala> sql("SELECT * FROM jsonTable").show java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:74) at org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:114) at org.apache.spark.sql.hive.execution.HiveTableScanExec.(HiveTableScanExec.scala:94) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$11.apply(HiveStrategies.scala:207) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$11.apply(HiveStrategies.scala:207) at org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:92) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:203) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) {code} > SQL add jar command does not work well with Scala REPL > -- > > Key: SPARK-8586 > URL: https://issues.apache.org/jira/browse/SPARK-8586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. > So, SerDe added through add jar command may not be loaded in the context > class loader when we lookup the table. > For example, the following code will fail when we try to show the table. > {code} > hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar") > hive.sql("drop table if exists jsonTable") > hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE > 'org.apache.hive.hcatalog.data.JsonSerDe'") > hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", > "val").insertInto("jsonTable") > hive.table("jsonTable").show > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8586) SQL add jar command does not work well with Scala REPL
[ https://issues.apache.org/jira/browse/SPARK-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-8586: Affects Version/s: 2.1.0 > SQL add jar command does not work well with Scala REPL > -- > > Key: SPARK-8586 > URL: https://issues.apache.org/jira/browse/SPARK-8586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. > So, SerDe added through add jar command may not be loaded in the context > class loader when we lookup the table. > For example, the following code will fail when we try to show the table. > {code} > hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar") > hive.sql("drop table if exists jsonTable") > hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE > 'org.apache.hive.hcatalog.data.JsonSerDe'") > hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", > "val").insertInto("jsonTable") > hive.table("jsonTable").show > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
Christian Reiniger created SPARK-20081: -- Summary: RandomForestClassifier doesn't seem to support more than 100 labels Key: SPARK-20081 URL: https://issues.apache.org/jira/browse/SPARK-20081 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.1.0 Environment: Java Reporter: Christian Reiniger When feeding data with more than 100 labels into RanfomForestClassifer#fit() (from java code), I get the following error message: {code} Classifier inferred 143 from label values in column rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) allowed to be inferred from values. To avoid this error for labels with > 100 classes, specify numClasses explicitly in the metadata; this can be done by applying StringIndexer to the label column. {code} Setting "numClasses" in the metadata for the label column doesn't make a difference. Looking at the code, this is not surprising, since MetadataUtils.getNumClasses() ignores this setting: {code:language=scala} def getNumClasses(labelSchema: StructField): Option[Int] = { Attribute.fromStructField(labelSchema) match { case binAttr: BinaryAttribute => Some(2) case nomAttr: NominalAttribute => nomAttr.getNumValues case _: NumericAttribute | UnresolvedAttribute => None } } {code} The alternative would be to pass a proper "maxNumClasses" parameter to the classifier, so that Classifier#getNumClasses() allows a larger number of auto-detected labels. However, RandomForestClassifer#train() calls #getNumClasses without the "maxNumClasses" parameter, causing it to use the default of 100: {code:language=scala} override protected def train(dataset: Dataset[_]): RandomForestClassificationModel = { val categoricalFeatures: Map[Int, Int] = MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) val numClasses: Int = getNumClasses(dataset) // ... {code} My scala skills are pretty sketchy, so please correct me if I misinterpreted something. But as it seems right now, there is no way to learn from data with more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1548) Add Partial Random Forest algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940130#comment-15940130 ] Mohamed Baddar commented on SPARK-1548: --- [~manishamde] [~sowen] [~josephkb] I have small experience in contributions on starter tasks in spark, and found this issue interesting. I was investigating regarding the partial implementation of RF, and found these resources: https://mahout.apache.org/users/classification/partial-implementation.html https://github.com/apache/mahout/blob/b5fe4aab22e7867ae057a6cdb1610cfa17555311/mr/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/package-info.java I thinks analyzing mahout implementation provides a good basis to start analyzing RF partial implementation in theory and practically. If this issue is still important to Spark, It would be great if I can start on it. I can start with creating analysis document for current mahout implementation to assess its performance > Add Partial Random Forest algorithm to MLlib > > > Key: SPARK-1548 > URL: https://issues.apache.org/jira/browse/SPARK-1548 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde > > This task involves creating an alternate approximate random forest > implementation where each tree is constructed per partition. > The tasks involves: > - Justifying with theory and experimental results why this algorithm is a > good choice. > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev updated SPARK-20080: --- Description: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd,foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. Forech action notexecuted. Expected result: Caused by: java.io.NotSerializableException: ReproduceBugMain$ description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get right exception. {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) at ReproduceBugMain$.main(ReproduceBugMain.scala:14) at ReproduceBugMain.main(ReproduceBugMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Caused by: java.io.NotSerializableException: ReproduceBugMain$ Serialization stack: - object not serializable (class: ReproduceBugMain$, value: ReproduceBugMain$@3935e9a8) - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: $outer, type: interface TraitWithMethod) - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) ... 18 more {code} On Github can be found 2 commit. 1 initial i add link on it(this one contain sptreaming example). and Last one with batch job example was: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get right exception. {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at
[jira] [Commented] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940113#comment-15940113 ] Nick Hryhoriev commented on SPARK-20080: It's clear issue. Even with a link to code. You have code for reproduce, it's more then enough to understand problem. Even if my english not good enough. > Spak streaming application do not throw serialisation exception in foreachRDD > - > > Key: SPARK-20080 > URL: https://issues.apache.org/jira/browse/SPARK-20080 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: local spark and yarn from big top 1.1.0 version >Reporter: Nick Hryhoriev >Priority: Minor > > When i try use or init org.slf4j.Logger inside foreachPartition, that > extracted to trait method. > What was called in foreachRDD. > I have found that foreachPartition method do not execute and no exception > appeared. > Tested on local and yarn mode spark. > code can be found on > [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. > There are two main class that explain problem. > if i will run same code with batch job. I will get right exception. > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Task not > serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) > at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) > at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) > at ReproduceBugMain$.main(ReproduceBugMain.scala:14) > at ReproduceBugMain.main(ReproduceBugMain.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) > Caused by: java.io.NotSerializableException: ReproduceBugMain$ > Serialization stack: > - object not serializable (class: ReproduceBugMain$, value: > ReproduceBugMain$@3935e9a8) > - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: > $outer, type: interface TraitWithMethod) > - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, > ) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) > ... 18 more > {code} > On Github can be found 2 commit. 1 initial i add link on it(this one contain > sptreaming example). and Last one with batch job example -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20080. --- Resolution: Invalid > Spak streaming application do not throw serialisation exception in foreachRDD > - > > Key: SPARK-20080 > URL: https://issues.apache.org/jira/browse/SPARK-20080 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: local spark and yarn from big top 1.1.0 version >Reporter: Nick Hryhoriev >Priority: Minor > > When i try use or init org.slf4j.Logger inside foreachPartition, that > extracted to trait method. > What was called in foreachRDD. > I have found that foreachPartition method do not execute and no exception > appeared. > Tested on local and yarn mode spark. > code can be found on > [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. > There are two main class that explain problem. > if i will run same code with batch job. I will get right exception. > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Task not > serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) > at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) > at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) > at ReproduceBugMain$.main(ReproduceBugMain.scala:14) > at ReproduceBugMain.main(ReproduceBugMain.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) > Caused by: java.io.NotSerializableException: ReproduceBugMain$ > Serialization stack: > - object not serializable (class: ReproduceBugMain$, value: > ReproduceBugMain$@3935e9a8) > - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: > $outer, type: interface TraitWithMethod) > - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, > ) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) > ... 18 more > {code} > On Github can be found 2 commit. 1 initial i add link on it(this one contain > sptreaming example). and Last one with batch job example -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-20080: --- > Spak streaming application do not throw serialisation exception in foreachRDD > - > > Key: SPARK-20080 > URL: https://issues.apache.org/jira/browse/SPARK-20080 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: local spark and yarn from big top 1.1.0 version >Reporter: Nick Hryhoriev >Priority: Minor > > When i try use or init org.slf4j.Logger inside foreachPartition, that > extracted to trait method. > What was called in foreachRDD. > I have found that foreachPartition method do not execute and no exception > appeared. > Tested on local and yarn mode spark. > code can be found on > [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. > There are two main class that explain problem. > if i will run same code with batch job. I will get right exception. > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Task not > serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) > at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) > at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) > at ReproduceBugMain$.main(ReproduceBugMain.scala:14) > at ReproduceBugMain.main(ReproduceBugMain.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) > Caused by: java.io.NotSerializableException: ReproduceBugMain$ > Serialization stack: > - object not serializable (class: ReproduceBugMain$, value: > ReproduceBugMain$@3935e9a8) > - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: > $outer, type: interface TraitWithMethod) > - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, > ) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) > ... 18 more > {code} > On Github can be found 2 commit. 1 initial i add link on it(this one contain > sptreaming example). and Last one with batch job example -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20080. --- Resolution: Fixed [~hryhoriev.nick] I can't understand what you're reporting here, and so it's not nearly suitable as a JIRA. Please read http://spark.apache.org/contributing.html This is not a place for technical support, but for describing clear issues or improvements along with a specific change if possible. If you significantly improve the the description here, I will reopen it. Do not reopen this issue on your own. > Spak streaming application do not throw serialisation exception in foreachRDD > - > > Key: SPARK-20080 > URL: https://issues.apache.org/jira/browse/SPARK-20080 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: local spark and yarn from big top 1.1.0 version >Reporter: Nick Hryhoriev >Priority: Minor > > When i try use or init org.slf4j.Logger inside foreachPartition, that > extracted to trait method. > What was called in foreachRDD. > I have found that foreachPartition method do not execute and no exception > appeared. > Tested on local and yarn mode spark. > code can be found on > [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. > There are two main class that explain problem. > if i will run same code with batch job. I will get right exception. > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Task not > serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) > at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) > at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) > at ReproduceBugMain$.main(ReproduceBugMain.scala:14) > at ReproduceBugMain.main(ReproduceBugMain.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) > Caused by: java.io.NotSerializableException: ReproduceBugMain$ > Serialization stack: > - object not serializable (class: ReproduceBugMain$, value: > ReproduceBugMain$@3935e9a8) > - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: > $outer, type: interface TraitWithMethod) > - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, > ) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) > ... 18 more > {code} > On Github can be found 2 commit. 1 initial i add link on it(this one contain > sptreaming example). and Last one with batch job example -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13644) Add the source file name and line into Logger when an exception occurs in the generated code
[ https://issues.apache.org/jira/browse/SPARK-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940078#comment-15940078 ] Takeshi Yamamuro commented on SPARK-13644: -- Could we close this? I think this issue has already resolved? > Add the source file name and line into Logger when an exception occurs in the > generated code > > > Key: SPARK-13644 > URL: https://issues.apache.org/jira/browse/SPARK-13644 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki >Priority: Minor > > This is to show a message points out the origin of a generated method when an > exception occurs in the generated method at runtime. > An example of a message (the first line is newly added) > {code} > 07:49:29.525 ERROR > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator: > The method GeneratedIterator.processNext() is generated for filter at > Test.scala:23 > 07:49:29.526 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 > in stage 2.0 (TID 4) > java.lang.NullPointerException: > at ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13644) Add the source file name and line into Logger when an exception occurs in the generated code
[ https://issues.apache.org/jira/browse/SPARK-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940078#comment-15940078 ] Takeshi Yamamuro edited comment on SPARK-13644 at 3/24/17 9:51 AM: --- Could we close this? I think this issue has already been resolved? was (Author: maropu): Could we close this? I think this issue has already resolved? > Add the source file name and line into Logger when an exception occurs in the > generated code > > > Key: SPARK-13644 > URL: https://issues.apache.org/jira/browse/SPARK-13644 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki >Priority: Minor > > This is to show a message points out the origin of a generated method when an > exception occurs in the generated method at runtime. > An example of a message (the first line is newly added) > {code} > 07:49:29.525 ERROR > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator: > The method GeneratedIterator.processNext() is generated for filter at > Test.scala:23 > 07:49:29.526 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 > in stage 2.0 (TID 4) > java.lang.NullPointerException: > at ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10719) SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940076#comment-15940076 ] Takeshi Yamamuro commented on SPARK-10719: -- I added a link to SPARK-19810 and we could close this if SPARK-19810 resolved. > SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10 > -- > > Key: SPARK-10719 > URL: https://issues.apache.org/jira/browse/SPARK-10719 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.1, 1.5.0, 1.6.0 > Environment: Scala 2.10 >Reporter: Shixiong Zhu > > Sometimes the following codes failed > {code} > val conf = new SparkConf().setAppName("sql-memory-leak") > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > (1 to 1000).par.foreach { _ => > sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() > } > {code} > The stack trace is > {code} > Exception in thread "main" java.lang.UnsupportedOperationException: tail of > empty list > at scala.collection.immutable.Nil$.tail(List.scala:339) > at scala.collection.immutable.Nil$.tail(List.scala:334) > at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172) > at > scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477) > at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777) > at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235) > at > scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34) > at > scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61) > at > scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12) > at > scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12) > at SparkApp$$anonfun$main$1.apply$mcJI$sp(SparkApp.scala:16) > at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15) > at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15) > at scala.Function1$class.apply$mcVI$sp(Function1.scala:39) > at > scala.runtime.AbstractFunction1.apply$mcVI$sp(AbstractFunction1.scala:12) > at > scala.collection.parallel.immutable.ParRange$ParRangeIterator.foreach(ParRange.scala:91) > at > scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:975) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) > at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) > at > scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:972) > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:172) > at > scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514) > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162) > at > scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514) > at > scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > Finally, I found the problem. The codes generated by Scala compiler to find > the implicit TypeTag are not thread safe because of an issue in Scala 2.10: > https://issues.scala-lang.org/browse/SI-6240 > This issue was fixed in Scala 2.11 but not backported to 2.10. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev updated SPARK-20080: --- Summary: Spak streaming application do not throw serialisation exception in foreachRDD (was: Spak streaming up do not throw serialisation exception in foreachRDD) > Spak streaming application do not throw serialisation exception in foreachRDD > - > > Key: SPARK-20080 > URL: https://issues.apache.org/jira/browse/SPARK-20080 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: local spark and yarn from big top 1.1.0 version >Reporter: Nick Hryhoriev >Priority: Minor > > When i try use or init org.slf4j.Logger inside foreachPartition, that > extracted to trait method. > What was called in foreachRDD. > I have found that foreachPartition method do not execute and no exception > appeared. > Tested on local and yarn mode spark. > code can be found on > [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. > There are two main class that explain problem. > if i will run same code with batch job. I will get right exception. > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Task not > serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) > at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) > at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) > at ReproduceBugMain$.main(ReproduceBugMain.scala:14) > at ReproduceBugMain.main(ReproduceBugMain.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) > Caused by: java.io.NotSerializableException: ReproduceBugMain$ > Serialization stack: > - object not serializable (class: ReproduceBugMain$, value: > ReproduceBugMain$@3935e9a8) > - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: > $outer, type: interface TraitWithMethod) > - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, > ) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) > ... 18 more > {code} > On Github can be found 2 commit. 1 initial i add link on it(this one contain > sptreaming example). and Last one with batch job example -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org