date:20170324

[jira] [Closed] (SPARK-6424) Support user-defined aggregators in AggregateFunction

2017-03-24 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed SPARK-6424.
---
Resolution: Duplicate

> Support user-defined aggregators in AggregateFunction
> -
>
> Key: SPARK-6424
> URL: https://issues.apache.org/jira/browse/SPARK-6424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takeshi Yamamuro
>
> Add a new interface to implement user-defined aggregators in 
> AggregateExpression.
> This enables third-parties to easily implement various operations on 
> DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6424) Support user-defined aggregators in AggregateFunction

2017-03-24 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941589#comment-15941589
 ] 

Takeshi Yamamuro commented on SPARK-6424:
-

We already have fixed the Aggregation improvements in SPARK-4366, so I'll close 
this.

> Support user-defined aggregators in AggregateFunction
> -
>
> Key: SPARK-6424
> URL: https://issues.apache.org/jira/browse/SPARK-6424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takeshi Yamamuro
>
> Add a new interface to implement user-defined aggregators in 
> AggregateExpression.
> This enables third-parties to easily implement various operations on 
> DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API

2017-03-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941577#comment-15941577
 ] 

Hyukjin Kwon commented on SPARK-20090:
--

Hm.. don't we have {{names}}?

Scala

{code}
scala> spark.range(1).schema.fieldNames
res1: Array[String] = Array(id)
{code}

Python

{code}
>>> spark.range(1).schema.names
['id']
{code}

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17137:


Assignee: Apache Spark  (was: Seth Hendrickson)

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941560#comment-15941560
 ] 

Apache Spark commented on SPARK-17137:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/17426

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17137:


Assignee: Seth Hendrickson  (was: Apache Spark)

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20091) DagScheduler should allow running concurrent attempts of a stage in case of multiple fetch failure

2017-03-24 Thread Sital Kedia (JIRA)

Sital Kedia created SPARK-20091:
---

 Summary: DagScheduler should allow running concurrent attempts of 
a stage in case of multiple fetch failure
 Key: SPARK-20091
 URL: https://issues.apache.org/jira/browse/SPARK-20091
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.0.1
Reporter: Sital Kedia


Currently, the Dag scheduler does not allow running concurrent attempts of a 
stage in case of multiple fetch failure. As a result, in case of multipe fetch 
failures are detected, serial execution of map stage delays the job run 
significantly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20090) Add StructType.fieldNames to Python API

2017-03-24 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-20090:
-

 Summary: Add StructType.fieldNames to Python API
 Key: SPARK-20090
 URL: https://issues.apache.org/jira/browse/SPARK-20090
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Affects Versions: 2.1.0
Reporter: Joseph K. Bradley
Priority: Trivial


The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would be 
nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20089) Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941410#comment-15941410
 ] 

Apache Spark commented on SPARK-20089:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17424

> Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite
> -
>
> Key: SPARK-20089
> URL: https://issues.apache.org/jira/browse/SPARK-20089
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {{DESC FUNCTION}} and {{DESC EXTENDED FUNCTION}} should be added to 
> SQLQueryTestSuite for us to review the outputs of each function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20089) Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20089:


Assignee: Apache Spark  (was: Xiao Li)

> Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite
> -
>
> Key: SPARK-20089
> URL: https://issues.apache.org/jira/browse/SPARK-20089
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {{DESC FUNCTION}} and {{DESC EXTENDED FUNCTION}} should be added to 
> SQLQueryTestSuite for us to review the outputs of each function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20089) Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20089:


Assignee: Xiao Li  (was: Apache Spark)

> Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite
> -
>
> Key: SPARK-20089
> URL: https://issues.apache.org/jira/browse/SPARK-20089
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {{DESC FUNCTION}} and {{DESC EXTENDED FUNCTION}} should be added to 
> SQLQueryTestSuite for us to review the outputs of each function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20089) Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite

2017-03-24 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20089:

Summary: Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite  
(was: Added DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite)

> Add DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite
> -
>
> Key: SPARK-20089
> URL: https://issues.apache.org/jira/browse/SPARK-20089
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {{DESC FUNCTION}} and {{DESC EXTENDED FUNCTION}} should be added to 
> SQLQueryTestSuite for us to review the outputs of each function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20089) Added DESC FUNCTION and DESC EXTENDED FUNCTION to SQLQueryTestSuite

2017-03-24 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20089:
---

 Summary: Added DESC FUNCTION and DESC EXTENDED FUNCTION to 
SQLQueryTestSuite
 Key: SPARK-20089
 URL: https://issues.apache.org/jira/browse/SPARK-20089
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


{{DESC FUNCTION}} and {{DESC EXTENDED FUNCTION}} should be added to 
SQLQueryTestSuite for us to review the outputs of each function.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20088:


Assignee: (was: Apache Spark)

> Do not create new SparkContext in SparkR createSparkContext
> ---
>
> Key: SPARK-20088
> URL: https://issues.apache.org/jira/browse/SPARK-20088
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> In the implementation of {{createSparkContext}}, we are calling 
> {code}
>  new JavaSparkContext()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20088:


Assignee: Apache Spark

> Do not create new SparkContext in SparkR createSparkContext
> ---
>
> Key: SPARK-20088
> URL: https://issues.apache.org/jira/browse/SPARK-20088
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Assignee: Apache Spark
>
> In the implementation of {{createSparkContext}}, we are calling 
> {code}
>  new JavaSparkContext()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941394#comment-15941394
 ] 

Apache Spark commented on SPARK-20088:
--

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/17423

> Do not create new SparkContext in SparkR createSparkContext
> ---
>
> Key: SPARK-20088
> URL: https://issues.apache.org/jira/browse/SPARK-20088
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> In the implementation of {{createSparkContext}}, we are calling 
> {code}
>  new JavaSparkContext()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext

2017-03-24 Thread Hossein Falaki (JIRA)

Hossein Falaki created SPARK-20088:
--

 Summary: Do not create new SparkContext in SparkR 
createSparkContext
 Key: SPARK-20088
 URL: https://issues.apache.org/jira/browse/SPARK-20088
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hossein Falaki


In the implementation of {{createSparkContext}}, we are calling 
{code}
 new JavaSparkContext()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19408) cardinality estimation involving two columns of the same table

2017-03-24 Thread Ron Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-19408:
---
Target Version/s: 2.2.0  (was: 2.3.0)

> cardinality estimation involving two columns of the same table
> --
>
> Key: SPARK-19408
> URL: https://issues.apache.org/jira/browse/SPARK-19408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Ron Hu
>
> In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
> literal", where op is =, <, <=, >, or >=.  In SQL queries, we also see 
> predicate expressions involving two columns such as "column-1 (op) column-2" 
> where column-1 and column-2 belong to same table.  Note that, if column-1 and 
> column-2 belong to different tables, then it is a join operator's work, NOT a 
> filter operator's work.
> In this jira, we want to estimate the filter factor of predicate expressions 
> involving two columns of same table.   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners

2017-03-24 Thread Charles Lewis (JIRA)

Charles Lewis created SPARK-20087:
-

 Summary: Include accumulators / taskMetrics when sending 
TaskKilled to onTaskEnd listeners
 Key: SPARK-20087
 URL: https://issues.apache.org/jira/browse/SPARK-20087
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Charles Lewis


When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive 
accumulators / task metrics for that task, if they were still available. These 
metrics are not currently sent when tasks are killed intentionally, such as 
when a speculative retry finishes, and the original is killed (or vice versa). 
Since we're killing these tasks ourselves, these metrics should almost always 
exist, and we should treat them the same way as we treat ExceptionFailures.

Sending these metrics with the TaskKilled end reason makes aggregation across 
all tasks in an app more accurate. This data can inform decisions about how to 
tune the speculation parameters in order to minimize duplicated work, and in 
general, the total cost of an app should include both successful and failed 
tasks, if that information exists.

PR: https://github.com/apache/spark/pull/17422



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20040) Python API for ml.stat.ChiSquareTest

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20040:


Assignee: Apache Spark

> Python API for ml.stat.ChiSquareTest
> 
>
> Key: SPARK-20040
> URL: https://issues.apache.org/jira/browse/SPARK-20040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Add PySpark wrapper for ChiSquareTest.  Note that it's currently called 
> ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20040) Python API for ml.stat.ChiSquareTest

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941367#comment-15941367
 ] 

Apache Spark commented on SPARK-20040:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/17421

> Python API for ml.stat.ChiSquareTest
> 
>
> Key: SPARK-20040
> URL: https://issues.apache.org/jira/browse/SPARK-20040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> Add PySpark wrapper for ChiSquareTest.  Note that it's currently called 
> ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20040) Python API for ml.stat.ChiSquareTest

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20040:


Assignee: (was: Apache Spark)

> Python API for ml.stat.ChiSquareTest
> 
>
> Key: SPARK-20040
> URL: https://issues.apache.org/jira/browse/SPARK-20040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> Add PySpark wrapper for ChiSquareTest.  Note that it's currently called 
> ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20070) Redact datasource explain output

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941345#comment-15941345
 ] 

Apache Spark commented on SPARK-20070:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17420

> Redact datasource explain output
> 
>
> Key: SPARK-20070
> URL: https://issues.apache.org/jira/browse/SPARK-20070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> When calling explain on a datasource, the output can contain sensitive 
> information. We should provide an admin/user to redact such information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941322#comment-15941322
 ] 

Apache Spark commented on SPARK-19634:
--

User 'thunterdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/17419

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19634:


Assignee: Apache Spark  (was: Timothy Hunter)

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Apache Spark
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19634:


Assignee: Timothy Hunter  (was: Apache Spark)

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19846) Add a flag to disable constraint propagation

2017-03-24 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-19846.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.2.0

> Add a flag to disable constraint propagation
> 
>
> Key: SPARK-19846
> URL: https://issues.apache.org/jira/browse/SPARK-19846
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> Constraint propagation can be computation expensive and block the driver 
> execution for long time. For example, the below benchmark needs 30mins.
> Compared with other attempts to modify how constraints propagation works, 
> this is a much simpler option: add a flag to disable constraint propagation.
> {code}
> import org.apache.spark.ml.{Pipeline, PipelineStage}
> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, 
> VectorAssembler}
> spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false)
> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, 
> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
> val indexers = df.columns.tail.map(c => new StringIndexer()
>   .setInputCol(c)
>   .setOutputCol(s"${c}_indexed")
>   .setHandleInvalid("skip"))
> val encoders = indexers.map(indexer => new OneHotEncoder()
>   .setInputCol(indexer.getOutputCol)
>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>   .setDropLast(true))
> val stages: Array[PipelineStage] = indexers ++ encoders
> val pipeline = new Pipeline().setStages(stages)
> val startTime = System.nanoTime
> pipeline.fit(df).transform(df).show
> val runningTime = System.nanoTime - startTime
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20070) Redact datasource explain output

2017-03-24 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20070.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Redact datasource explain output
> 
>
> Key: SPARK-20070
> URL: https://issues.apache.org/jira/browse/SPARK-20070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> When calling explain on a datasource, the output can contain sensitive 
> information. We should provide an admin/user to redact such information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2017-03-24 Thread Jouni H (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987
 ] 

Jouni H edited comment on SPARK-12216 at 3/24/17 10:25 PM:
---

I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

aws-java-sdk-1.7.4.jar can be downloaded from here: 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar


bugtest.txt:
{noformat}
one
two
three
{noformat}

sparkbugtest.py:

{noformat}
import sys
import time

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()


lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

time.sleep(10) # just for debug to see how files change in temporary folder

# at this point there is only this script (sparkbugtest.py) in the 
temporary folder

lines.saveAsTextFile(sys.argv[2])

# at this point there is both sparkbugtest.py and aws-java-sdk-1.7.4.jar in 
the temporary folder

time.sleep(10) # for debug

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

What happens in the userFiles tmp folder is interesting:

* At first there is the {{sparkbugtest.py}}
* At the end (I think during saveAsTextFile, or after it), the 
{{aws-java-sdk-1.7.4.jar}} file is copied there and the {{sparkbugtest.py}} 
get's deleted
* After the spark-submit has ended the {{aws-java-sdk-1.7.4.jar}} is still in 
the temporary folder that couldn't be deleted 

The temp folder in this example was like: 
C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\




was (Author: jouni):
I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

aws-java-sdk-1.7.4.jar can be downloaded from here: 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar

The input in the bugtest.txt doesn't matter.

Example script:

{noformat}
import sys

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
lines.saveAsTextFile(sys.argv[2])

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

What happens in the userFiles tmp folder is interesting:

* At first there is the {{sparkbugtest.py}}
* At the end (I think during saveAsTextFile, or after it), the 
{{aws-java-sdk-1.7.4.jar}} file is copied there and the {{sparkbugtest.py}} 
get's deleted
* After the spark-submit has ended the {{aws-java-sdk-1.7.4.jar}} is still in 
the temporary folder that couldn't be deleted 

The temp folder in this example was like: 
C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\



> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH

[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-24 Thread mandar uapdhye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mandar uapdhye updated SPARK-20086:
---
Description: 
original  post at
[stackoverflow | 
http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]

I get error when working with pyspark window function. here is some example 
code:

{code:title=borderStyle=solid}
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
from pyspark.sql import window

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
df.show()

{code}

gives:


|  x|AmtPaid|
|  1|2.0|
|  1|3.0|
|  1|1.0|
|  1|   -2.0|
|  1|   -1.0|


next, compute cumulative sum

{code:title=test.py|borderStyle=solid}
win_spec_max = (window.Window
.partitionBy(['x'])
.rowsBetween(window.Window.unboundedPreceding, 0)))
df = df.withColumn('AmtPaidCumSum',
   sf.sum(sf.col('AmtPaid')).over(win_spec_max))
df.show()
{code}

gives,


|  x|AmtPaid|AmtPaidCumSum|
|  1|2.0|  2.0|
|  1|3.0|  5.0|
|  1|1.0|  6.0|
|  1|   -2.0|  4.0|
|  1|   -1.0|  3.0|

next, compute cumulative max,

{code}
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))

df.show()
{code}

gives error log

{noformat}
 Py4JJavaError: An error occurred while calling o2609.showString.


with traceback:


Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
{noformat}

but interestingly enough, if i introduce another change before sencond window 
operation, say inserting a column then it does not give that error:

{code}
df = df.withColumn('MaxBound', sf.lit(6.))
df.show()
{code}


|  x|AmtPaid|AmtPaidCumSum|MaxBound|
|  1|2.0|  2.0| 6.0|
|  1|3.0|  5.0| 6.0|
|  1|1.0|  6.0| 6.0|
|  1|   -2.0|  4.0| 6.0|
|  1|   -1.0|  3.0| 6.0|

{code}
#then apply the second window operations
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
df.show()
{code}

|  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
|  1|2.0|  2.0| 6.0| 2.0|
|  1|3.0|  5.0| 6.0| 5.0|
|  1|1.0|  6.0| 6.0| 6.0|
|  1|   -2.0|  4.0| 6.0| 6.0|
|  1|   -1.0|  3.0| 6.0| 6.0|

I do not understand this behaviour

well, so far so good, but then I try another operation then again get similar 
error:

{code}
def _udf_compare_cumsum_sll(x):
if x['AmtPaidCumSumMax'] >= x['MaxBound']:
output = 0
else:
output = x['AmtPaid']
return output

udf_compare_cumsum_sll = sf.udf(_udf_compare_cumsum_sll, 
sparktypes.FloatType())
df = df.withColumn('AmtPaidAdjusted', 
udf_compare_cumsum_sll(sf.struct([df[x] for x in df.columns])))
df.show()
{code}

gives,

{noformat}
Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc

[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2017-03-24 Thread Jouni H (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987
 ] 

Jouni H edited comment on SPARK-12216 at 3/24/17 9:58 PM:
--

I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

aws-java-sdk-1.7.4.jar can be downloaded from here: 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar

The input in the bugtest.txt doesn't matter.

Example script:

{noformat}
import sys

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
lines.saveAsTextFile(sys.argv[2])

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

What happens in the userFiles tmp folder is interesting:

* At first there is the {{sparkbugtest.py}}
* At the end (I think during saveAsTextFile, or after it), the 
{{aws-java-sdk-1.7.4.jar}} file is copied there and the {{sparkbugtest.py}} 
get's deleted
* After the spark-submit has ended the {{aws-java-sdk-1.7.4.jar}} is still in 
the temporary folder that couldn't be deleted 

The temp folder in this example was like: 
C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\




was (Author: jouni):
I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

aws-java-sdk-1.7.4.jar can be downloaded from here: 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar

The input in the bugtest.txt doesn't matter.

Example script:

{noformat}
import sys

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
lines.saveAsTextFile(sys.argv[2])

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

What happens in the userFiles tmp folder is interesting:

* At first there is the {{sparkbugtest.py}}
* At the end, the {{aws-java-sdk-1.7.4.jar}} file is copied there and the 
{{sparkbugtest.py}} get's deleted
* After the spark-submit has ended the {{aws-java-sdk-1.7.4.jar}} is still in 
the temporary folder that couldn't be deleted 

The temp folder in this example was like: 
C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\



> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority:

[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2017-03-24 Thread Jouni H (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987
 ] 

Jouni H edited comment on SPARK-12216 at 3/24/17 9:54 PM:
--

I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

aws-java-sdk-1.7.4.jar can be downloaded from here: 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar

The input in the bugtest.txt doesn't matter.

Example script:

{noformat}
import sys

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
lines.saveAsTextFile(sys.argv[2])

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

What happens in the userFiles tmp folder is interesting:

* At first there is the {{sparkbugtest.py}}
* At the end, the {{aws-java-sdk-1.7.4.jar}} file is copied there and the 
{{sparkbugtest.py}} get's deleted
* After the spark-submit has ended the {{aws-java-sdk-1.7.4.jar}} is still in 
the temporary folder that couldn't be deleted 

The temp folder in this example was like: 
C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\




was (Author: jouni):
I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

aws-java-sdk-1.7.4.jar can be downloaded from here: 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar

The input in the bugtest.txt doesn't matter.

Example script:

{noformat}
import sys

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
lines.saveAsTextFile(sys.argv[2])

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

After the error is thrown and and spark-submit has ended, I take a look at the 
folder that couldn't be deleted, it has the .jar file inside, for example 
{{C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\aws-java-sdk-1.7.4.jar}}


> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
>

[jira] [Commented] (SPARK-19674) Ignore driver accumulator updates don't belong to the execution when merging all accumulator updates

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941213#comment-15941213
 ] 

Apache Spark commented on SPARK-19674:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/17418

> Ignore driver accumulator updates don't belong to the execution when merging 
> all accumulator updates
> 
>
> Key: SPARK-19674
> URL: https://issues.apache.org/jira/browse/SPARK-19674
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> In SQLListener.getExecutionMetrics, driver accumulator updates don't belong 
> to the execution should be ignored when merging all accumulator updates to 
> prevent NoSuchElementException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-24 Thread mandar uapdhye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mandar uapdhye updated SPARK-20086:
---
Description: 
original  post at
[stackoverflow | 
http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]

I get error when working with pyspark window function. here is some example 
code:

{code:title=borderStyle=solid}
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
from pyspark.sql import window

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
df.show()

{code}

gives:


|  x|AmtPaid|
|  1|2.0|
|  1|3.0|
|  1|1.0|
|  1|   -2.0|
|  1|   -1.0|


next, compute cumulative sum

{code:title=test.py|borderStyle=solid}
win_spec_max = (window.Window
.partitionBy(['x'])
.rowsBetween(window.Window.unboundedPreceding, 0)))
df = df.withColumn('AmtPaidCumSum',
   sf.sum(sf.col('AmtPaid')).over(win_spec_max))
df.show()
{code}

gives,


|  x|AmtPaid|AmtPaidCumSum|
|  1|2.0|  2.0|
|  1|3.0|  5.0|
|  1|1.0|  6.0|
|  1|   -2.0|  4.0|
|  1|   -1.0|  3.0|

next, compute cumulative max,

{code}
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))

df.show()
{code}

gives error log

{noformat}
 Py4JJavaError: An error occurred while calling o2609.showString.


with traceback:


Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
{noformat}

but interestingly enough, if i introduce another change before sencond window 
operation, say inserting a column then it does not give that error:

{code}
df = df.withColumn('MaxBound', sf.lit(6.))
df.show()
{code}


|  x|AmtPaid|AmtPaidCumSum|MaxBound|
|  1|2.0|  2.0| 6.0|
|  1|3.0|  5.0| 6.0|
|  1|1.0|  6.0| 6.0|
|  1|   -2.0|  4.0| 6.0|
|  1|   -1.0|  3.0| 6.0|

{code}
#then apply the second window operations
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
df.show()
{code}

|  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
|  1|2.0|  2.0| 6.0| 2.0|
|  1|3.0|  5.0| 6.0| 5.0|
|  1|1.0|  6.0| 6.0| 6.0|
|  1|   -2.0|  4.0| 6.0| 6.0|
|  1|   -1.0|  3.0| 6.0| 6.0|

I do not understand this behaviour

well, so far so good, but then I try another operation then again get similar 
error:

{code}
def _udf_compare_cumsum_sll(x):
if x['AmtPaidCumSumMax'] >= x['MaxBound']:
output = 0
else:
output = x['AmtPaid']

udf_compare_cumsum_sll = sf.udf(_udf_compare_cumsum_sll, 
sparktypes.FloatType())
df = df.withColumn('AmtPaidAdjusted', 
udf_compare_cumsum_sll(sf.struct([df[x] for x in df.columns])))
df.show()
{code}

gives,

{noformat}
Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n,

[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-24 Thread mandar uapdhye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mandar uapdhye updated SPARK-20086:
---
Description: 
original  post at
[stackoverflow | 
http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]

I get error when working with pyspark window function. here is some example 
code:

{code:title=borderStyle=solid}
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
from pyspark.sql import window

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
df.show()

{code}

gives:


|  x|AmtPaid|
|  1|2.0|
|  1|3.0|
|  1|1.0|
|  1|   -2.0|
|  1|   -1.0|


next, compute cumulative sum

{code:title=test.py|borderStyle=solid}
win_spec_max = (window.Window
.partitionBy(['x'])
.rowsBetween(window.Window.unboundedPreceding, 0)))
df = df.withColumn('AmtPaidCumSum',
   sf.sum(sf.col('AmtPaid')).over(win_spec_max))
df.show()
{code}

gives,


|  x|AmtPaid|AmtPaidCumSum|
|  1|2.0|  2.0|
|  1|3.0|  5.0|
|  1|1.0|  6.0|
|  1|   -2.0|  4.0|
|  1|   -1.0|  3.0|

next, compute cumulative max,

{code}
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))

df.show()
{code}

gives error log

{noformat}
 Py4JJavaError: An error occurred while calling o2609.showString.


with traceback:

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
{noformat}

but interestingly enough, if i introduce another change before sencond window 
operation, say inserting a column then it does not give that error:

{code}
df = df.withColumn('MaxBound', sf.lit(6.))
df.show()
{code}


|  x|AmtPaid|AmtPaidCumSum|MaxBound|
|  1|2.0|  2.0| 6.0|
|  1|3.0|  5.0| 6.0|
|  1|1.0|  6.0| 6.0|
|  1|   -2.0|  4.0| 6.0|
|  1|   -1.0|  3.0| 6.0|

{code}
#then apply the second window operations
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
df.show()
{code}

|  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
|  1|2.0|  2.0| 6.0| 2.0|
|  1|3.0|  5.0| 6.0| 5.0|
|  1|1.0|  6.0| 6.0| 6.0|
|  1|   -2.0|  4.0| 6.0| 6.0|
|  1|   -1.0|  3.0| 6.0| 6.0|

I do not understand this behaviour

well, so far so good, but then I try another operation then again get similar 
error:

{code}
def _udf_compare_cumsum_sll(x):
if x['AmtPaidCumSumMax'] >= x['MaxBound']:
output = 0
else:
output = x['AmtPaid']

udf_compare_cumsum_sll = sf.udf(_udf_compare_cumsum_sll, 
sparktypes.FloatType())
df = df.withColumn('AmtPaidAdjusted', 
udf_compare_cumsum_sll(sf.struct([df[x] for x in df.columns])))
df.show()
{code}

gives,

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)

[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-24 Thread mandar uapdhye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mandar uapdhye updated SPARK-20086:
---
Description: 
original  post at
[stackoverflow | 
http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]

I get error when working with pyspark window function. here is some example 
code:

{code:title=borderStyle=solid}
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
from pyspark.sql import window

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
df.show()

{code}

gives:


|  x|AmtPaid|
|  1|2.0|
|  1|3.0|
|  1|1.0|
|  1|   -2.0|
|  1|   -1.0|


next, compute cumulative sum

{code:title=test.py|borderStyle=solid}
win_spec_max = (window.Window
.partitionBy(['x'])
.rowsBetween(window.Window.unboundedPreceding, 0)))
df = df.withColumn('AmtPaidCumSum',
   sf.sum(sf.col('AmtPaid')).over(win_spec_max))
df.show()
{code}

gives,


|  x|AmtPaid|AmtPaidCumSum|
|  1|2.0|  2.0|
|  1|3.0|  5.0|
|  1|1.0|  6.0|
|  1|   -2.0|  4.0|
|  1|   -1.0|  3.0|

next, compute cumulative max,

{code}
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))

df.show()
{code}

gives error log


 Py4JJavaError: An error occurred while calling o2609.showString.


with traceback:

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(


but interestingly enough, if i introduce another change before sencond window 
operation, say inserting a column then it does not give that error:

{code}
df = df.withColumn('MaxBound', sf.lit(6.))
df.show()
{code}


|  x|AmtPaid|AmtPaidCumSum|MaxBound|
|  1|2.0|  2.0| 6.0|
|  1|3.0|  5.0| 6.0|
|  1|1.0|  6.0| 6.0|
|  1|   -2.0|  4.0| 6.0|
|  1|   -1.0|  3.0| 6.0|

{code}
#then apply the second window operations
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
df.show()
{code}

|  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
|  1|2.0|  2.0| 6.0| 2.0|
|  1|3.0|  5.0| 6.0| 5.0|
|  1|1.0|  6.0| 6.0| 6.0|
|  1|   -2.0|  4.0| 6.0| 6.0|
|  1|   -1.0|  3.0| 6.0| 6.0|

I do not understand this behaviour

well, so far so good, but then I try another operation then again get similar 
error:

{code}
def _udf_compare_cumsum_sll(x):
if x['AmtPaidCumSumMax'] >= x['MaxBound']:
output = 0
else:
output = x['AmtPaid']

udf_compare_cumsum_sll = sf.udf(_udf_compare_cumsum_sll, 
sparktypes.FloatType())
df = df.withColumn('AmtPaidAdjusted', 
udf_compare_cumsum_sll(sf.struct([df[x] for x in df.columns])))
df.show()
{code}

gives,

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)
316 """

[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-24 Thread mandar uapdhye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mandar uapdhye updated SPARK-20086:
---
Description: 
original  post at
[stackoverflow | 
http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]

I get error when working with pyspark window function. here is some example 
code:

{code:title=borderStyle=solid}
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
from pyspark.sql import window

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
df.show()

{code}

gives:


|  x|AmtPaid|
|  1|2.0|
|  1|3.0|
|  1|1.0|
|  1|   -2.0|
|  1|   -1.0|


next, compute cumulative sum

{code:title=test.py|borderStyle=solid}
win_spec_max = (window.Window
.partitionBy(['x'])
.rowsBetween(window.Window.unboundedPreceding, 0)))
df = df.withColumn('AmtPaidCumSum',
   sf.sum(sf.col('AmtPaid')).over(win_spec_max))
df.show()
{code}

gives,

+---+---+-+
|  x|AmtPaid|AmtPaidCumSum|
+---+---+-+
|  1|2.0|  2.0|
|  1|3.0|  5.0|
|  1|1.0|  6.0|
|  1|   -2.0|  4.0|
|  1|   -1.0|  3.0|
+---+---+-+ 

next, compute cumulative max,

{code: borderStyle=solid}
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))

df.show()
{code}

gives error log


 Py4JJavaError: An error occurred while calling o2609.showString.


with traceback:

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(


but interestingly enough, if i introduce another change before sencond window 
operation, say inserting a column then it does not give that error:

{code}
df = df.withColumn('MaxBound', sf.lit(6.))
df.show()
{code}

+---+---+-++
|  x|AmtPaid|AmtPaidCumSum|MaxBound|
+---+---+-++
|  1|2.0|  2.0| 6.0|
|  1|3.0|  5.0| 6.0|
|  1|1.0|  6.0| 6.0|
|  1|   -2.0|  4.0| 6.0|
|  1|   -1.0|  3.0| 6.0|
+---+---+-++

{code}
#then apply the second window operations
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
df.show()
{code}

+---+---+-+++
|  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
+---+---+-+++
|  1|2.0|  2.0| 6.0| 2.0|
|  1|3.0|  5.0| 6.0| 5.0|
|  1|1.0|  6.0| 6.0| 6.0|
|  1|   -2.0|  4.0| 6.0| 6.0|
|  1|   -1.0|  3.0| 6.0| 6.0|
+---+---+-+++   

I do not understand this behaviour

well, so far so good, but then I try another operation then again get similar 
error:

{code}
def _udf_compare_cumsum_sll(x):
if x['AmtPaidCumSumMax'] >= x['MaxBound']:
output = 0
else:
output = x['AmtPaid']

[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-24 Thread mandar uapdhye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mandar uapdhye updated SPARK-20086:
---
Description: 
original  post at
[stackoverflow | 
http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]

I get error when working with pyspark window function. here is some example 
code:

{code:title=borderStyle=solid}
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
from pyspark.sql import window

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
df.show()

{code}

gives:

+---+---+
|  x|AmtPaid|
+---+---+
|  1|2.0|
|  1|3.0|
|  1|1.0|
|  1|   -2.0|
|  1|   -1.0|
+---+---+

next, compute cumulative sum

{code:title=test.py|borderStyle=solid}
win_spec_max = (window.Window
.partitionBy(['x'])
.rowsBetween(window.Window.unboundedPreceding, 0)))
df = df.withColumn('AmtPaidCumSum',
   sf.sum(sf.col('AmtPaid')).over(win_spec_max))
df.show()
{code}

gives,

+---+---+-+
|  x|AmtPaid|AmtPaidCumSum|
+---+---+-+
|  1|2.0|  2.0|
|  1|3.0|  5.0|
|  1|1.0|  6.0|
|  1|   -2.0|  4.0|
|  1|   -1.0|  3.0|
+---+---+-+ 

next, compute cumulative max,

{code: borderStyle=solid}
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))

df.show()
{code}

gives error log


 Py4JJavaError: An error occurred while calling o2609.showString.


with traceback:

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(


but interestingly enough, if i introduce another change before sencond window 
operation, say inserting a column then it does not give that error:

{code}
df = df.withColumn('MaxBound', sf.lit(6.))
df.show()
{code}

+---+---+-++
|  x|AmtPaid|AmtPaidCumSum|MaxBound|
+---+---+-++
|  1|2.0|  2.0| 6.0|
|  1|3.0|  5.0| 6.0|
|  1|1.0|  6.0| 6.0|
|  1|   -2.0|  4.0| 6.0|
|  1|   -1.0|  3.0| 6.0|
+---+---+-++

{code}
#then apply the second window operations
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
df.show()
{code}

+---+---+-+++
|  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
+---+---+-+++
|  1|2.0|  2.0| 6.0| 2.0|
|  1|3.0|  5.0| 6.0| 5.0|
|  1|1.0|  6.0| 6.0| 6.0|
|  1|   -2.0|  4.0| 6.0| 6.0|
|  1|   -1.0|  3.0| 6.0| 6.0|
+---+---+-+++   

I do not understand this behaviour

well, so far so good, but then I try another operation then again get similar 
error:

{code}
def _udf_compare_cumsum_sll(x):
if x['AmtPaidCumSumMax'] >= x['MaxBound']:
output = 0
else:

[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2017-03-24 Thread Jouni H (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987
 ] 

Jouni H edited comment on SPARK-12216 at 3/24/17 9:37 PM:
--

I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

aws-java-sdk-1.7.4.jar can be downloaded from here: 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar

The input in the bugtest.txt doesn't matter.

Example script:

{noformat}
import sys

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
lines.saveAsTextFile(sys.argv[2])

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

After the error is thrown and and spark-submit has ended, I take a look at the 
folder that couldn't be deleted, it has the .jar file inside, for example 
{{C:\Users\Jouni\AppData\Local\Temp\spark-9b68fc91-7ee7-481a-970d-38a6db6f6160\userFiles-948dc876-bced-4778-98a7-90944a7fb155\aws-java-sdk-1.7.4.jar}}



was (Author: jouni):
I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

aws-java-sdk-1.7.4.jar can be downloaded from here: 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar

The input in the bugtest.txt doesn't matter.

Example script:

{noformat}
import sys

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
lines.saveAsTextFile(sys.argv[2])

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at

[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-24 Thread mandar uapdhye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mandar uapdhye updated SPARK-20086:
---
Description: 
original  post at
[stackoverflow | 
http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]

I get error when working with pyspark window function. here is some example 
code:

{code:title=test.py|borderStyle=solid}
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
from pyspark.sql import window

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
df.show()

{code}

gives:

+---+---+
|  x|AmtPaid|
+---+---+
|  1|2.0|
|  1|3.0|
|  1|1.0|
|  1|   -2.0|
|  1|   -1.0|
+---+---+

next, compute cumulative sum

{code:title=test.py|borderStyle=solid}
win_spec_max = (window.Window
.partitionBy(['x'])
.rowsBetween(window.Window.unboundedPreceding, 0)))
df = df.withColumn('AmtPaidCumSum',
   sf.sum(sf.col('AmtPaid')).over(win_spec_max))
df.show()
{code}

gives,

+---+---+-+
|  x|AmtPaid|AmtPaidCumSum|
+---+---+-+
|  1|2.0|  2.0|
|  1|3.0|  5.0|
|  1|1.0|  6.0|
|  1|   -2.0|  4.0|
|  1|   -1.0|  3.0|
+---+---+-+ 

next, compute cumulative max,

{code: borderStyle=solid}
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))

df.show()
{code}

gives error log


 Py4JJavaError: An error occurred while calling o2609.showString.


with traceback:

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(


but interestingly enough, if i introduce another change before sencond window 
operation, say inserting a column then it does not give that error:

df = df.withColumn('MaxBound', sf.lit(6.))
df.show()
+---+---+-++
|  x|AmtPaid|AmtPaidCumSum|MaxBound|
+---+---+-++
|  1|2.0|  2.0| 6.0|
|  1|3.0|  5.0| 6.0|
|  1|1.0|  6.0| 6.0|
|  1|   -2.0|  4.0| 6.0|
|  1|   -1.0|  3.0| 6.0|
+---+---+-++


#then apply the second window operations
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
df.show()

+---+---+-+++
|  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
+---+---+-+++
|  1|2.0|  2.0| 6.0| 2.0|
|  1|3.0|  5.0| 6.0| 5.0|
|  1|1.0|  6.0| 6.0| 6.0|
|  1|   -2.0|  4.0| 6.0| 6.0|
|  1|   -1.0|  3.0| 6.0| 6.0|
+---+---+-+++   

I do not understand this behaviour

well, so far so good, but then I try another operation then again get similar 
error:

def _udf_compare_cumsum_sll(x):
if x['AmtPaidCumSumMax'] >= x['MaxBound']:
output = 0
else:
output =

[jira] [Updated] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-24 Thread mandar uapdhye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mandar uapdhye updated SPARK-20086:
---
Description: 
original  post at
[stackoverflow | 
http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]

I get error when working with pyspark window function. here is some example 
code:

{code:title=test.py|borderStyle=solid}
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
from pyspark.sql import window

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
df.show()

{code}

gives:

+---+---+
|  x|AmtPaid|
+---+---+
|  1|2.0|
|  1|3.0|
|  1|1.0|
|  1|   -2.0|
|  1|   -1.0|
+---+---+

next, compute cumulative sum

win_spec_max = (window.Window
.partitionBy(['x'])
.rowsBetween(window.Window.unboundedPreceding, 0)))
df = df.withColumn('AmtPaidCumSum',
   sf.sum(sf.col('AmtPaid')).over(win_spec_max))
df.show()

gives,

+---+---+-+
|  x|AmtPaid|AmtPaidCumSum|
+---+---+-+
|  1|2.0|  2.0|
|  1|3.0|  5.0|
|  1|1.0|  6.0|
|  1|   -2.0|  4.0|
|  1|   -1.0|  3.0|
+---+---+-+ 

next, compute cumulative max,

df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))

df.show()

gives error log


 Py4JJavaError: An error occurred while calling o2609.showString.


with traceback:

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(


but interestingly enough, if i introduce another change before sencond window 
operation, say inserting a column then it does not give that error:

df = df.withColumn('MaxBound', sf.lit(6.))
df.show()
+---+---+-++
|  x|AmtPaid|AmtPaidCumSum|MaxBound|
+---+---+-++
|  1|2.0|  2.0| 6.0|
|  1|3.0|  5.0| 6.0|
|  1|1.0|  6.0| 6.0|
|  1|   -2.0|  4.0| 6.0|
|  1|   -1.0|  3.0| 6.0|
+---+---+-++


#then apply the second window operations
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
df.show()

+---+---+-+++
|  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
+---+---+-+++
|  1|2.0|  2.0| 6.0| 2.0|
|  1|3.0|  5.0| 6.0| 5.0|
|  1|1.0|  6.0| 6.0| 6.0|
|  1|   -2.0|  4.0| 6.0| 6.0|
|  1|   -1.0|  3.0| 6.0| 6.0|
+---+---+-+++   

I do not understand this behaviour

well, so far so good, but then I try another operation then again get similar 
error:

def _udf_compare_cumsum_sll(x):
if x['AmtPaidCumSumMax'] >= x['MaxBound']:
output = 0
else:
output = x['AmtPaid']

udf_compare_cumsum_sll = sf.udf(_udf_compare_cumsum_sll,

[jira] [Created] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-24 Thread mandar uapdhye (JIRA)

mandar uapdhye created SPARK-20086:
--

 Summary: issue with pyspark 2.1.0 window function
 Key: SPARK-20086
 URL: https://issues.apache.org/jira/browse/SPARK-20086
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: mandar uapdhye


original  post at
[stackoverflow | 
http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]

I get error when working with pyspark window function. here is some example 
code:

import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
from pyspark.sql import window

sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
df.show()

gives:

+---+---+
|  x|AmtPaid|
+---+---+
|  1|2.0|
|  1|3.0|
|  1|1.0|
|  1|   -2.0|
|  1|   -1.0|
+---+---+

next, compute cumulative sum

win_spec_max = (window.Window
.partitionBy(['x'])
.rowsBetween(window.Window.unboundedPreceding, 0)))
df = df.withColumn('AmtPaidCumSum',
   sf.sum(sf.col('AmtPaid')).over(win_spec_max))
df.show()

gives,

+---+---+-+
|  x|AmtPaid|AmtPaidCumSum|
+---+---+-+
|  1|2.0|  2.0|
|  1|3.0|  5.0|
|  1|1.0|  6.0|
|  1|   -2.0|  4.0|
|  1|   -1.0|  3.0|
+---+---+-+ 

next, compute cumulative max,

df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))

df.show()

gives error log


 Py4JJavaError: An error occurred while calling o2609.showString.


with traceback:

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 df.show()

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
show(self, n, truncate)
316 """
317 if isinstance(truncate, bool) and truncate:
--> 318 print(self._jdf.showString(n, 20))
319 else:
320 print(self._jdf.showString(n, int(truncate)))

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(


but interestingly enough, if i introduce another change before sencond window 
operation, say inserting a column then it does not give that error:

df = df.withColumn('MaxBound', sf.lit(6.))
df.show()
+---+---+-++
|  x|AmtPaid|AmtPaidCumSum|MaxBound|
+---+---+-++
|  1|2.0|  2.0| 6.0|
|  1|3.0|  5.0| 6.0|
|  1|1.0|  6.0| 6.0|
|  1|   -2.0|  4.0| 6.0|
|  1|   -1.0|  3.0| 6.0|
+---+---+-++


#then apply the second window operations
df = df.withColumn('AmtPaidCumSumMax', 
sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
df.show()

+---+---+-+++
|  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
+---+---+-+++
|  1|2.0|  2.0| 6.0| 2.0|
|  1|3.0|  5.0| 6.0| 5.0|
|  1|1.0|  6.0| 6.0| 6.0|
|  1|   -2.0|  4.0| 6.0| 6.0|
|  1|   -1.0|  3.0| 6.0| 6.0|
+---+---+-+++   

I do not understand this behaviour

well, so far so good, but then I try another operation then again get similar 
error:

def _udf_compare_cumsum_sll(x):
if x['AmtPaidCumSumMax'] >= x['MaxBound']:

[jira] [Assigned] (SPARK-20075) Support classifier, packaging in Maven coordinates

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20075:


Assignee: Sean Owen  (was: Apache Spark)

> Support classifier, packaging in Maven coordinates
> --
>
> Key: SPARK-20075
> URL: https://issues.apache.org/jira/browse/SPARK-20075
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Currently, it's possible to add dependencies to an app using its Maven 
> coordinates on the command line: {{group:artifact:version}}. However, really 
> Maven coordinates are 5-dimensional: 
> {{group:artifact:packaging:classifier:version}}. In some rare but real cases 
> it's important to be able to specify the classifier. And while we're at it 
> why not try to support packaging?
> I have a WIP PR that I'll post soon.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20075) Support classifier, packaging in Maven coordinates

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941155#comment-15941155
 ] 

Apache Spark commented on SPARK-20075:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17416

> Support classifier, packaging in Maven coordinates
> --
>
> Key: SPARK-20075
> URL: https://issues.apache.org/jira/browse/SPARK-20075
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Currently, it's possible to add dependencies to an app using its Maven 
> coordinates on the command line: {{group:artifact:version}}. However, really 
> Maven coordinates are 5-dimensional: 
> {{group:artifact:packaging:classifier:version}}. In some rare but real cases 
> it's important to be able to specify the classifier. And while we're at it 
> why not try to support packaging?
> I have a WIP PR that I'll post soon.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20075) Support classifier, packaging in Maven coordinates

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20075:


Assignee: Apache Spark  (was: Sean Owen)

> Support classifier, packaging in Maven coordinates
> --
>
> Key: SPARK-20075
> URL: https://issues.apache.org/jira/browse/SPARK-20075
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, it's possible to add dependencies to an app using its Maven 
> coordinates on the command line: {{group:artifact:version}}. However, really 
> Maven coordinates are 5-dimensional: 
> {{group:artifact:packaging:classifier:version}}. In some rare but real cases 
> it's important to be able to specify the classifier. And while we're at it 
> why not try to support packaging?
> I have a WIP PR that I'll post soon.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18484) case class datasets - ability to specify decimal precision and scale

2017-03-24 Thread Leif Warner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941138#comment-15941138
 ] 

Leif Warner commented on SPARK-18484:
-

In the example on the ticket, the precision on the input BigDecimals is set to 
2. Spark is discarding that information when it creates the tables, and setting 
the scale to 18 in the schema. When it converts from table form to BigDecimal, 
it respects the precision specified in the schema in the created BigDecimals.

> case class datasets - ability to specify decimal precision and scale
> 
>
> Key: SPARK-18484
> URL: https://issues.apache.org/jira/browse/SPARK-18484
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Damian Momot
>
> Currently when using decimal type (BigDecimal in scala case class) there's no 
> way to enforce precision and scale. This is quite critical when saving data - 
> regarding space usage and compatibility with external systems (for example 
> Hive table) because spark saves data as Decimal(38,18)
> {code}
> case class TestClass(id: String, money: BigDecimal)
> val testDs = spark.createDataset(Seq(
>   TestClass("1", BigDecimal("22.50")),
>   TestClass("2", BigDecimal("500.66"))
> ))
> testDs.printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(38,18) (nullable = true)
> {code}
> Workaround is to convert dataset to dataframe before saving and manually cast 
> to specific decimal scale/precision:
> {code}
> import org.apache.spark.sql.types.DecimalType
> val testDf = testDs.toDF()
> testDf
>   .withColumn("money", testDf("money").cast(DecimalType(10,2)))
>   .printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(10,2) (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2017-03-24 Thread Jouni H (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987
 ] 

Jouni H edited comment on SPARK-12216 at 3/24/17 8:59 PM:
--

I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

aws-java-sdk-1.7.4.jar can be downloaded from here: 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar

The input in the bugtest.txt doesn't matter.

Example script:

{noformat}
import sys

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
lines.saveAsTextFile(sys.argv[2])

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html


was (Author: jouni):
I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

The input in the bugtest.txt doesn't matter.

Example script:

{noformat}
import sys

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
lines.saveAsTextFile(sys.argv[2])

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
>

[jira] [Commented] (SPARK-19408) cardinality estimation involving two columns of the same table

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941117#comment-15941117
 ] 

Apache Spark commented on SPARK-19408:
--

User 'ron8hu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17415

> cardinality estimation involving two columns of the same table
> --
>
> Key: SPARK-19408
> URL: https://issues.apache.org/jira/browse/SPARK-19408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Ron Hu
>
> In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
> literal", where op is =, <, <=, >, or >=.  In SQL queries, we also see 
> predicate expressions involving two columns such as "column-1 (op) column-2" 
> where column-1 and column-2 belong to same table.  Note that, if column-1 and 
> column-2 belong to different tables, then it is a join operator's work, NOT a 
> filter operator's work.
> In this jira, we want to estimate the filter factor of predicate expressions 
> involving two columns of same table.   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2017-03-24 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941116#comment-15941116
 ] 

DB Tsai commented on SPARK-17137:
-

[~sethah] We can do the performance test during the QA time. Thanks.

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19408) cardinality estimation involving two columns of the same table

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19408:


Assignee: Apache Spark

> cardinality estimation involving two columns of the same table
> --
>
> Key: SPARK-19408
> URL: https://issues.apache.org/jira/browse/SPARK-19408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Ron Hu
>Assignee: Apache Spark
>
> In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
> literal", where op is =, <, <=, >, or >=.  In SQL queries, we also see 
> predicate expressions involving two columns such as "column-1 (op) column-2" 
> where column-1 and column-2 belong to same table.  Note that, if column-1 and 
> column-2 belong to different tables, then it is a join operator's work, NOT a 
> filter operator's work.
> In this jira, we want to estimate the filter factor of predicate expressions 
> involving two columns of same table.   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2017-03-24 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-17137:
---

Assignee: Seth Hendrickson

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19408) cardinality estimation involving two columns of the same table

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19408:


Assignee: (was: Apache Spark)

> cardinality estimation involving two columns of the same table
> --
>
> Key: SPARK-19408
> URL: https://issues.apache.org/jira/browse/SPARK-19408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Ron Hu
>
> In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
> literal", where op is =, <, <=, >, or >=.  In SQL queries, we also see 
> predicate expressions involving two columns such as "column-1 (op) column-2" 
> where column-1 and column-2 belong to same table.  Note that, if column-1 and 
> column-2 belong to different tables, then it is a join operator's work, NOT a 
> filter operator's work.
> In this jira, we want to estimate the filter factor of predicate expressions 
> involving two columns of same table.   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2017-03-24 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941114#comment-15941114
 ] 

Seth Hendrickson commented on SPARK-17137:
--

I can make a PR for using this inside the MLOR code, but I probably won't have 
time to do performance tests within the next couple of days (since code freeze 
has already passed). [~dbtsai] Do you think we need to do performance tests 
before this patch goes in? 

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2017-03-24 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941106#comment-15941106
 ] 

DB Tsai commented on SPARK-17137:
-

Ping! SPARK-17471 is merged. Anyone interested in making this by 2.2.0 release?

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941104#comment-15941104
 ] 

Sean Owen commented on SPARK-19476:
---

I don't think in general you're expected to be able to do this safely. Why 
would you do this asynchronously or with more partitions, simply?

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
> TungstenAggregationIterator uses a ThreadLocal variable that returns null 
> when called from a thread other than the original thread that got the 
> iterator from Spark. From examining the code, this does not appear to differ 
> between recent Spark versions.
> However, this limitation is specific to TungstenAggregationIterator, and not 
> documented, as far as I'm aware.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17471) Add compressed method for Matrix class

2017-03-24 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-17471.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15628
[https://github.com/apache/spark/pull/15628]

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
> Fix For: 2.2.0
>
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-24 Thread Lucy Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941079#comment-15941079
 ] 

Lucy Yu commented on SPARK-19476:
-

I believe we've seen a similar issue raised here:
https://github.com/memsql/memsql-spark-connector/issues/31

It seems that

{code}
sqlDf.foreachPartition(partition => {
  new Thread(new Runnable {
override def run(): Unit = {
  for (row <- partition) {
// do nothing here, just force the partition to be fully iterated over
  }
}
  }).start()
})
{code}

results in

{code}
java.lang.NullPointerException
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:287)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:573)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:86)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:161)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:148)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(UnknownSource)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextPartition(WindowExec.scala:361)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:391)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:290)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(UnknownSource)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
com.cainc.data.etl.lake.TestMartOutput$$anonfun$main$3$$anon$1.run(TestMartOutput.scala:42)
at java.lang.Thread.run(Thread.java:745)
{code}

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
>

[jira] [Assigned] (SPARK-20085) Configurable mesos labels for executors

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20085:


Assignee: (was: Apache Spark)

> Configurable mesos labels for executors
> ---
>
> Key: SPARK-20085
> URL: https://issues.apache.org/jira/browse/SPARK-20085
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kalvin Chau
>Priority: Minor
>
> Add spark.mesos.task.labels configuration option to add mesos key:value 
> labels to the executor.
>  "k1:v1,k2:v2" as the format, colons separating key-value and commas to list 
> out more than one.
> See discussion on label at https://github.com/apache/spark/pull/17404



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20085) Configurable mesos labels for executors

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20085:


Assignee: Apache Spark

> Configurable mesos labels for executors
> ---
>
> Key: SPARK-20085
> URL: https://issues.apache.org/jira/browse/SPARK-20085
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kalvin Chau
>Assignee: Apache Spark
>Priority: Minor
>
> Add spark.mesos.task.labels configuration option to add mesos key:value 
> labels to the executor.
>  "k1:v1,k2:v2" as the format, colons separating key-value and commas to list 
> out more than one.
> See discussion on label at https://github.com/apache/spark/pull/17404



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20085) Configurable mesos labels for executors

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941065#comment-15941065
 ] 

Apache Spark commented on SPARK-20085:
--

User 'kalvinnchau' has created a pull request for this issue:
https://github.com/apache/spark/pull/17413

> Configurable mesos labels for executors
> ---
>
> Key: SPARK-20085
> URL: https://issues.apache.org/jira/browse/SPARK-20085
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kalvin Chau
>Priority: Minor
>
> Add spark.mesos.task.labels configuration option to add mesos key:value 
> labels to the executor.
>  "k1:v1,k2:v2" as the format, colons separating key-value and commas to list 
> out more than one.
> See discussion on label at https://github.com/apache/spark/pull/17404



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20085) Configurable mesos labels for executors

2017-03-24 Thread Kalvin Chau (JIRA)

Kalvin Chau created SPARK-20085:
---

 Summary: Configurable mesos labels for executors
 Key: SPARK-20085
 URL: https://issues.apache.org/jira/browse/SPARK-20085
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Kalvin Chau
Priority: Minor


Add spark.mesos.task.labels configuration option to add mesos key:value labels 
to the executor.

 "k1:v1,k2:v2" as the format, colons separating key-value and commas to list 
out more than one.

See discussion on label at https://github.com/apache/spark/pull/17404



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19911) Add builder interface for Kinesis DStreams

2017-03-24 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19911.
-
  Resolution: Fixed
Assignee: Adam Budde
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

Resolved with https://github.com/apache/spark/pull/17250

> Add builder interface for Kinesis DStreams
> --
>
> Key: SPARK-19911
> URL: https://issues.apache.org/jira/browse/SPARK-19911
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Adam Budde
>Priority: Minor
> Fix For: 2.2.0
>
>
> The ```KinesisUtils.createStream()``` interface for creating Kinesis-based 
> DStreams is quite brittle and requires adding a combinatorial number of 
> overrides whenever another optional configuration parameter is added. This 
> makes incorporating a lot of additional features supported by the Kinesis 
> Client Library such as per-service authorization unfeasible. This interface 
> should be replaced by a builder pattern class 
> (https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater 
> extensibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20084) Remove internal.metrics.updatedBlockStatuses accumulator from history files

2017-03-24 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-20084:
-

 Summary: Remove internal.metrics.updatedBlockStatuses accumulator 
from history files
 Key: SPARK-20084
 URL: https://issues.apache.org/jira/browse/SPARK-20084
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 2.1.0
Reporter: Ryan Blue


History files for large jobs can be hundreds of GB. These history files take 
too much space and create a backlog on the history server.

Most of the size is from Accumulables in SparkListenerTaskEnd. The largest 
accumulable is internal.metrics.updatedBlockStatuses, which has a small update 
(the blocks that were changed) but a huge value (all known blocks). Nothing 
currently uses the accumulator value or update, so it is safe to remove it. 
Information for any block updated during a task is also recorded under Task 
Metrics / Updated Blocks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20084) Remove internal.metrics.updatedBlockStatuses accumulator from history files

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20084:


Assignee: Apache Spark

> Remove internal.metrics.updatedBlockStatuses accumulator from history files
> ---
>
> Key: SPARK-20084
> URL: https://issues.apache.org/jira/browse/SPARK-20084
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> History files for large jobs can be hundreds of GB. These history files take 
> too much space and create a backlog on the history server.
> Most of the size is from Accumulables in SparkListenerTaskEnd. The largest 
> accumulable is internal.metrics.updatedBlockStatuses, which has a small 
> update (the blocks that were changed) but a huge value (all known blocks). 
> Nothing currently uses the accumulator value or update, so it is safe to 
> remove it. Information for any block updated during a task is also recorded 
> under Task Metrics / Updated Blocks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20084) Remove internal.metrics.updatedBlockStatuses accumulator from history files

2017-03-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941012#comment-15941012
 ] 

Apache Spark commented on SPARK-20084:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/17412

> Remove internal.metrics.updatedBlockStatuses accumulator from history files
> ---
>
> Key: SPARK-20084
> URL: https://issues.apache.org/jira/browse/SPARK-20084
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Ryan Blue
>
> History files for large jobs can be hundreds of GB. These history files take 
> too much space and create a backlog on the history server.
> Most of the size is from Accumulables in SparkListenerTaskEnd. The largest 
> accumulable is internal.metrics.updatedBlockStatuses, which has a small 
> update (the blocks that were changed) but a huge value (all known blocks). 
> Nothing currently uses the accumulator value or update, so it is safe to 
> remove it. Information for any block updated during a task is also recorded 
> under Task Metrics / Updated Blocks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20084) Remove internal.metrics.updatedBlockStatuses accumulator from history files

2017-03-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20084:


Assignee: (was: Apache Spark)

> Remove internal.metrics.updatedBlockStatuses accumulator from history files
> ---
>
> Key: SPARK-20084
> URL: https://issues.apache.org/jira/browse/SPARK-20084
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Ryan Blue
>
> History files for large jobs can be hundreds of GB. These history files take 
> too much space and create a backlog on the history server.
> Most of the size is from Accumulables in SparkListenerTaskEnd. The largest 
> accumulable is internal.metrics.updatedBlockStatuses, which has a small 
> update (the blocks that were changed) but a huge value (all known blocks). 
> Nothing currently uses the accumulator value or update, so it is safe to 
> remove it. Information for any block updated during a task is also recorded 
> under Task Metrics / Updated Blocks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2017-03-24 Thread Jouni H (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940987#comment-15940987
 ] 

Jouni H commented on SPARK-12216:
-

I was able to reproduce this bug on Windows with the latest spark version: 
spark-2.1.0-bin-hadoop2.7

This bug happens for me when I include --jars for spark-submit AND use 
saveAsTextOut on the script.

Example scenarios:

* ERROR when include --jars AND use saveAsTextFile 
* Works when use saveAsTextFile, but don't use any --jars on command line 
* Works when you include --jars on command line but don't use saveAsTextOut 
(comment out)

Example command line: {{spark-submit --jars aws-java-sdk-1.7.4.jar 
sparkbugtest.py bugtest.txt ./output/test1/}}

The script here doesn't need the --jars file, but if you include it on the 
command line, it causes the shutdown bug.

The input in the bugtest.txt doesn't matter.

Example script:

{noformat}
import sys

from pyspark.sql import SparkSession

def main():

# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("SparkParseLogTest")\
.getOrCreate()

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
lines.saveAsTextFile(sys.argv[2])

if __name__ == "__main__":
main()

{noformat}

I also use winutils.exe as mentioned here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-24 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940960#comment-15940960
 ] 

yuhao yang commented on SPARK-20082:


Yes, that's one of the things that we should improve for LDA.
If you're interested in working on the issue, could you please first share some 
rough design, given the complexity from both EM and Online optimizers and 
models.

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD

2017-03-24 Thread gagan taneja (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940897#comment-15940897
 ] 

gagan taneja commented on SPARK-19705:
--

Sandy
Would you be able to help with this bug.

I saw you had filed earlier bug for HDFS Cache support to RDD and this is 
relatively minor change. We have been running this code in production and we 
are able to achieve major performance boost
Below is the reference to earlier bug filed by you 
https://issues.apache.org/jira/browse/SPARK-1767

> Preferred location supporting HDFS Cache for FileScanRDD
> 
>
> Key: SPARK-19705
> URL: https://issues.apache.org/jira/browse/SPARK-19705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating 
> preferredLocations, FileScanRDD do not take into account HDFS cache while 
> calculating preferredLocations
> The enhancement can be easily implemented for large files where FilePartition 
> only contains single HDFS file
> The enhancement will also result in significant performance improvement for 
> cached hdfs partitions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2017-03-24 Thread Ian Maloney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940796#comment-15940796
 ] 

Ian Maloney commented on SPARK-12664:
-

Does anyone know which versions this will make it into? Also, when it might 
make it into a release?

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Weichen Xu
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major

2017-03-24 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-20083:


 Summary: Change matrix toArray to not create a new array when 
matrix is already column major
 Key: SPARK-20083
 URL: https://issues.apache.org/jira/browse/SPARK-20083
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Seth Hendrickson
Priority: Minor


{{toArray}} always creates a new array in column major format, even when the 
resulting array is the same as the backing values. We should change this to 
just return a reference to the values array when it is already column major.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-24 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20082:
--
Component/s: (was: MLlib)
 ML

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-24 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20082:
--
Issue Type: New Feature  (was: Wish)

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2017-03-24 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940672#comment-15940672
 ] 

Stéphane Collot edited comment on SPARK-17557 at 3/24/17 4:37 PM:
--

Hi,

I got a similar issue with Spark 2.0.0 and it appears that I had a corrupted 
partitioned table: I had some a column with different column types inside 
different partitions. It happened because I was writing partitions manually 
inside subfolders.
So reading a specific partition was working, but reading the entire table was 
giving this PlainValuesDictionary exception on a df.show() or on 
df.sort('column').show()

So it was my fault, but Spark could check if the schema in each partition is 
the same.

Cheers,
Stéphane


was (Author: scollot):
Hi,

I got a similar issue with Spark 2.0.0 and it appears that I had partitioned 
table, and I had some different columns types inside some partitions because I 
was writing those manually inside the subfolders of the partitions. So reading 
a specific partition was working, but reading the entire table was giving this 
PlainValuesDictionary exception on a df.show() or on df.sort('column').show()
So it was my fault, but Spark could check if the schema in each partition is 
the same.

Cheers,
Stéphane

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2017-03-24 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940672#comment-15940672
 ] 

Stéphane Collot commented on SPARK-17557:
-

Hi,

I got a similar issue with Spark 2.0.0 and it appears that I had partitioned 
table, and I had some different columns types inside some partitions because I 
was writing those manually inside the subfolders of the partitions. So reading 
a specific partition was working, but reading the entire table was giving this 
PlainValuesDictionary exception on a df.show() or on df.sort('column').show()
So it was my fault, but Spark could check if the schema in each partition is 
the same.

Cheers,
Stéphane

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)

2017-03-24 Thread Don Drake (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940654#comment-15940654
 ] 

Don Drake commented on SPARK-16745:
---

I just came across the same exception running Spark 2.1.0 on my Mac, running a 
few spark-shell commands on a tiny dataset that previously worked just fine.  
But, I never got a result, just the timeout exceptions.

The issue is that today I'm running them on a corporate VPN, with proxy setting 
enabled, and the IP address the driver is using is my local (Wifi) address that 
the proxy server cannot connect to.  

This took a while to figure out, but I added {{--conf 
spark.driver.host=127.0.0.1}} to my command-line and that forced all local 
networking between driver and executors (to bypass the proxy server) and the 
query came back in the expected amount of time.

> Spark job completed however have to wait for 13 mins (data size is small)
> -
>
> Key: SPARK-16745
> URL: https://issues.apache.org/jira/browse/SPARK-16745
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
> Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014
>Reporter: Joe Chong
>Priority: Minor
>
> I submitted a job in scala spark shell to show a DataFrame. The data size is 
> about 43K. The job was successful in the end, but took more than 13 minutes 
> to resolve. Upon checking the log, there's multiple exception raised on 
> "Failed to check existence of class" with a java.net.connectionexpcetion 
> message indicating timeout trying to connect to the port 52067, the repl port 
> that Spark setup. Please assist to troubleshoot. Thanks. 
> Started Spark in standalone mode
> $ spark-shell --driver-memory 5g --master local[*]
> 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong
> 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: 
> joechong
> 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(joechong); users 
> with modify permissions: Set(joechong)
> 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server
> 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/07/26 21:05:30 INFO server.AbstractConnector: Started 
> SocketConnector@0.0.0.0:52067
> 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class 
> server' on port 52067.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1
> 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong
> 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: 
> joechong
> 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(joechong); users 
> with modify permissions: Set(joechong)
> 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' 
> on port 52072.
> 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 16/07/26 21:05:35 INFO Remoting: Starting remoting
> 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074]
> 16/07/26 21:05:35 INFO util.Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 52074.
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster
> 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at 
> /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6
> 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity 
> 3.4 GB
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator
> 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/07/26 21:05:36 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:4040
> 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on 
> port 4040.
> 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at 
> http://10.199.29.218:4040
> 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host 
> localhost
> 16/07/26 21:05:36 INFO

[jira] [Commented] (SPARK-7200) Tungsten test suites should fail if memory leak is detected

2017-03-24 Thread Jose Soltren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940604#comment-15940604
 ] 

Jose Soltren commented on SPARK-7200:
-

[~joshrosen] please have a look at my PR. Is this what you had in mind? Was 
there a different approach you were considering? Did you have any other tests 
in mind?

I imagine the code has changed some since April of 2015. I didn't see too many 
methods exposed by @VisibleForTesting.

> Tungsten test suites should fail if memory leak is detected
> ---
>
> Key: SPARK-7200
> URL: https://issues.apache.org/jira/browse/SPARK-7200
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Tests
>Reporter: Reynold Xin
>
> We should be able to detect whether there are unreturned memory after each 
> suite and fail the suite.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-24 Thread Everett Anderson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940596#comment-15940596
 ] 

Everett Anderson commented on SPARK-20073:
--

Doesn't the join criteria convey that the user doesn't want a Cartesian 
product, though?

Ideally, the fix would be to not do Cartesian product when 2 tables have a join 
criteria that isn't a constant of 'true'.

> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // However, we really don't want or expect a Cartesian product!
> val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
> variantCounts("name")<=>(people("name")))
> innerJoinSqlNullSafeEqOp.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> |Fred|2|null|   10|   3|
> |Fred|2|null|   10|   4|
> |Fred|2| Amy|   12|   5|
> |Fred|2| Amy|   12|   6|
> |null|2|Fred|8|   1|
> |null|2|Fred|8|   2|
> |null|2|null|   10|   3|
> |null|2|null|   10|   4|
> |null|2| Amy|   12|   5|
> |null|2| Amy|   12|   6|
> | Amy|2|Fred|8|   1|
> | Amy|2|Fred|8|   2|
> | Amy|2|null|   10|   3|
> | Amy|2|null|   10|   4|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay, let's try to construct the exact same variantCount table manually
> // so it has no relationship to the original.
> val variantCountRowsRDD = sc.parallelize(Seq(
> Row("Fred", 2),
> Row(null, 2),
> Row("Amy", 2)))
> 
> val

[jira] [Assigned] (SPARK-15040) PySpark impl for ml.feature.Imputer

2017-03-24 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-15040:
--

Assignee: Nick Pentreath

> PySpark impl for ml.feature.Imputer
> ---
>
> Key: SPARK-15040
> URL: https://issues.apache.org/jira/browse/SPARK-15040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.2.0
>
>
> PySpark impl for ml.feature.Imputer.
> This need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15040) PySpark impl for ml.feature.Imputer

2017-03-24 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15040.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17316
[https://github.com/apache/spark/pull/17316]

> PySpark impl for ml.feature.Imputer
> ---
>
> Key: SPARK-15040
> URL: https://issues.apache.org/jira/browse/SPARK-15040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
> Fix For: 2.2.0
>
>
> PySpark impl for ml.feature.Imputer.
> This need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2017-03-24 Thread Nick Hryhoriev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev updated SPARK-20080:
---
Description: 
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd,foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. foreachPartition action not executed.
Expected result:
Throw java.io.NotSerializableException

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get exception ->
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
at ReproduceBugMain.main(ReproduceBugMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: ReproduceBugMain$
Serialization stack:
- object not serializable (class: ReproduceBugMain$, value: 
ReproduceBugMain$@3935e9a8)
- field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
$outer, type: interface TraitWithMethod)
- object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 18 more
{code}

On Github can be found 2 commit. 1 initial i add link on it(this one contain 
sptreaming example). and Last one with batch job example

  was:
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd,foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. foreachPartition action not executed.
Expected result:
Throw java.io.NotSerializableException

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get right exception.
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at

[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2017-03-24 Thread Nick Hryhoriev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev updated SPARK-20080:
---
Description: 
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd,foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. foreachPartition action not executed.
Expected result:
Throw java.io.NotSerializableException

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get right exception.
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
at ReproduceBugMain.main(ReproduceBugMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: ReproduceBugMain$
Serialization stack:
- object not serializable (class: ReproduceBugMain$, value: 
ReproduceBugMain$@3935e9a8)
- field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
$outer, type: interface TraitWithMethod)
- object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 18 more
{code}

On Github can be found 2 commit. 1 initial i add link on it(this one contain 
sptreaming example). and Last one with batch job example

  was:
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd,foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. foreachPartition action not executed.
Expected result:
Caused by: java.io.NotSerializableException: ReproduceBugMain$

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get right exception.
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at

[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2017-03-24 Thread Nick Hryhoriev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev updated SPARK-20080:
---
Description: 
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd,foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. foreachPartition action not executed.
Expected result:
Caused by: java.io.NotSerializableException: ReproduceBugMain$

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get right exception.
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
at ReproduceBugMain.main(ReproduceBugMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: ReproduceBugMain$
Serialization stack:
- object not serializable (class: ReproduceBugMain$, value: 
ReproduceBugMain$@3935e9a8)
- field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
$outer, type: interface TraitWithMethod)
- object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 18 more
{code}

On Github can be found 2 commit. 1 initial i add link on it(this one contain 
sptreaming example). and Last one with batch job example

  was:
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd,foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. Foraech action not executed.
Expected result:
Caused by: java.io.NotSerializableException: ReproduceBugMain$

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get right exception.
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at

[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2017-03-24 Thread Nick Hryhoriev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev updated SPARK-20080:
---
Description: 
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd,foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. Foraech action not executed.
Expected result:
Caused by: java.io.NotSerializableException: ReproduceBugMain$

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get right exception.
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
at ReproduceBugMain.main(ReproduceBugMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: ReproduceBugMain$
Serialization stack:
- object not serializable (class: ReproduceBugMain$, value: 
ReproduceBugMain$@3935e9a8)
- field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
$outer, type: interface TraitWithMethod)
- object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 18 more
{code}

On Github can be found 2 commit. 1 initial i add link on it(this one contain 
sptreaming example). and Last one with batch job example

  was:
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd,foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. Forech action notexecuted.
Expected result:
Caused by: java.io.NotSerializableException: ReproduceBugMain$

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get right exception.
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at

[jira] [Created] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-24 Thread Mathieu D (JIRA)

Mathieu D created SPARK-20082:
-

 Summary: Incremental update of LDA model, by adding initialModel 
as start point
 Key: SPARK-20082
 URL: https://issues.apache.org/jira/browse/SPARK-20082
 Project: Spark
  Issue Type: Wish
  Components: MLlib
Affects Versions: 2.1.0
Reporter: Mathieu D


Some mllib models support an initialModel to start from and update it 
incrementally with new data.

>From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
>update an existing model with batches of new documents.

I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8586) SQL add jar command does not work well with Scala REPL

2017-03-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8586.
--
Resolution: Duplicate

I think this and several JIRAs are just the same issue: add jar vs --jars and 
what classloader ends up with the classes.

> SQL add jar command does not work well with Scala REPL
> --
>
> Key: SPARK-8586
> URL: https://issues.apache.org/jira/browse/SPARK-8586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 2.1.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. 
> So, SerDe added through add jar command may not be loaded in the context 
> class loader when we lookup the table.
> For example, the following code will fail when we try to show the table. 
> {code}
> hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar")
> hive.sql("drop table if exists jsonTable")
> hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe'")
> hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", 
> "val").insertInto("jsonTable")
> hive.table("jsonTable").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20019) spark can not load alluxio fileSystem after adding jar

2017-03-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20019.
---
Resolution: Duplicate

> spark can not load alluxio fileSystem after adding jar
> --
>
> Key: SPARK-20019
> URL: https://issues.apache.org/jira/browse/SPARK-20019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: roncenzhao
> Attachments: exception_stack.png
>
>
> The follwing sql cannot load alluxioSystem and it throws 
> `ClassNotFoundException`.
> ```
> add jar /xxx/xxx/alluxioxxx.jar;
> set fs.alluxio.impl=alluxio.hadoop.FileSystem;
> select * from alluxionTbl;
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8586) SQL add jar command does not work well with Scala REPL

2017-03-24 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940185#comment-15940185
 ] 

Takeshi Yamamuro commented on SPARK-8586:
-

This issue still happens in v2.1.0 and the master.
{code}
scala> sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar")
scala> sql("drop table if exists jsonTable")
scala> sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE 
'org.apache.hive.hcatalog.data.JsonSerDe'")
scala> spark.range(100).selectExpr("CAST(id AS INT) AS key", "'xxx' AS 
val").write.insertInto("jsonTable")
scala> sql("SELECT * FROM jsonTable").show
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.hive.hcatalog.data.JsonSerDe
  at 
org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:74)
  at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:114)
  at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.(HiveTableScanExec.scala:94)
  at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$11.apply(HiveStrategies.scala:207)
  at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$11.apply(HiveStrategies.scala:207)
  at 
org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:92)
  at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:203)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
{code}

> SQL add jar command does not work well with Scala REPL
> --
>
> Key: SPARK-8586
> URL: https://issues.apache.org/jira/browse/SPARK-8586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 2.1.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. 
> So, SerDe added through add jar command may not be loaded in the context 
> class loader when we lookup the table.
> For example, the following code will fail when we try to show the table. 
> {code}
> hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar")
> hive.sql("drop table if exists jsonTable")
> hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe'")
> hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", 
> "val").insertInto("jsonTable")
> hive.table("jsonTable").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8586) SQL add jar command does not work well with Scala REPL

2017-03-24 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-8586:

Affects Version/s: 2.1.0

> SQL add jar command does not work well with Scala REPL
> --
>
> Key: SPARK-8586
> URL: https://issues.apache.org/jira/browse/SPARK-8586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 2.1.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. 
> So, SerDe added through add jar command may not be loaded in the context 
> class loader when we lookup the table.
> For example, the following code will fail when we try to show the table. 
> {code}
> hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar")
> hive.sql("drop table if exists jsonTable")
> hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe'")
> hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", 
> "val").insertInto("jsonTable")
> hive.table("jsonTable").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels

2017-03-24 Thread Christian Reiniger (JIRA)

Christian Reiniger created SPARK-20081:
--

 Summary: RandomForestClassifier doesn't seem to support more than 
100 labels
 Key: SPARK-20081
 URL: https://issues.apache.org/jira/browse/SPARK-20081
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.1.0
 Environment: Java
Reporter: Christian Reiniger


When feeding data with more than 100 labels into RanfomForestClassifer#fit() 
(from java code), I get the following error message:

{code}
Classifier inferred 143 from label values in column rfc_df0e968db9df__labelCol, 
but this exceeded the max numClasses (100) allowed to be inferred from values.  
  To avoid this error for labels with > 100 classes, specify numClasses 
explicitly in the metadata; this can be done by applying StringIndexer to the 
label column.
{code}

Setting "numClasses" in the metadata for the label column doesn't make a 
difference. Looking at the code, this is not surprising, since 
MetadataUtils.getNumClasses() ignores this setting:

{code:language=scala}
  def getNumClasses(labelSchema: StructField): Option[Int] = {
Attribute.fromStructField(labelSchema) match {
  case binAttr: BinaryAttribute => Some(2)
  case nomAttr: NominalAttribute => nomAttr.getNumValues
  case _: NumericAttribute | UnresolvedAttribute => None
}
  }
{code}

The alternative would be to pass a proper "maxNumClasses" parameter to the 
classifier, so that Classifier#getNumClasses() allows a larger number of 
auto-detected labels. However, RandomForestClassifer#train() calls 
#getNumClasses without the "maxNumClasses" parameter, causing it to use the 
default of 100:

{code:language=scala}
  override protected def train(dataset: Dataset[_]): 
RandomForestClassificationModel = {
val categoricalFeatures: Map[Int, Int] =
  MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
val numClasses: Int = getNumClasses(dataset)
// ...
{code}

My scala skills are pretty sketchy, so please correct me if I misinterpreted 
something. But as it seems right now, there is no way to learn from data with 
more than 100 labels via RandomForestClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1548) Add Partial Random Forest algorithm to MLlib

2017-03-24 Thread Mohamed Baddar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940130#comment-15940130
 ] 

Mohamed Baddar commented on SPARK-1548:
---

[~manishamde] [~sowen] [~josephkb] 
I have small experience in contributions on starter tasks in spark, and found 
this issue interesting. I was investigating regarding the partial 
implementation of RF, and found these resources:

https://mahout.apache.org/users/classification/partial-implementation.html
https://github.com/apache/mahout/blob/b5fe4aab22e7867ae057a6cdb1610cfa17555311/mr/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/package-info.java

I thinks analyzing mahout implementation provides a good basis to start 
analyzing RF partial implementation in theory and practically. If this issue is 
still important to Spark, It would be great if I can start on it. I can start 
with creating analysis document for current mahout implementation to assess its 
performance

> Add Partial Random Forest algorithm to MLlib
> 
>
> Key: SPARK-1548
> URL: https://issues.apache.org/jira/browse/SPARK-1548
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>
> This task involves creating an alternate approximate random forest 
> implementation where each tree is constructed per partition.
> The tasks involves:
> - Justifying with theory and experimental results why this algorithm is a 
> good choice.
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2017-03-24 Thread Nick Hryhoriev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev updated SPARK-20080:
---
Description: 
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd,foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. Forech action notexecuted.
Expected result:
Caused by: java.io.NotSerializableException: ReproduceBugMain$

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get right exception.
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
at ReproduceBugMain.main(ReproduceBugMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: ReproduceBugMain$
Serialization stack:
- object not serializable (class: ReproduceBugMain$, value: 
ReproduceBugMain$@3935e9a8)
- field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
$outer, type: interface TraitWithMethod)
- object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 18 more
{code}

On Github can be found 2 commit. 1 initial i add link on it(this one contain 
sptreaming example). and Last one with batch job example

  was:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get right exception.
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at

[jira] [Commented] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2017-03-24 Thread Nick Hryhoriev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940113#comment-15940113
 ] 

Nick Hryhoriev commented on SPARK-20080:


It's clear issue. Even with a link to code. 
You have code for reproduce, it's more then enough to understand problem. 
Even if my english not good enough.

> Spak streaming application do not throw serialisation exception in foreachRDD
> -
>
> Key: SPARK-20080
> URL: https://issues.apache.org/jira/browse/SPARK-20080
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
> Environment: local spark and yarn from big top 1.1.0 version
>Reporter: Nick Hryhoriev
>Priority: Minor
>
> When i try use or init org.slf4j.Logger inside foreachPartition, that 
> extracted to trait method. 
> What was called in foreachRDD. 
> I have found that foreachPartition method do not execute and no exception 
> appeared.
> Tested on local and yarn mode spark.
> code can be found on 
> [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
>  There are two main class that explain problem.
> if i will run same code with batch job. I will get right exception.
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Task not 
> serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
> at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
> at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
> at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
> at ReproduceBugMain.main(ReproduceBugMain.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: java.io.NotSerializableException: ReproduceBugMain$
> Serialization stack:
> - object not serializable (class: ReproduceBugMain$, value: 
> ReproduceBugMain$@3935e9a8)
> - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
> $outer, type: interface TraitWithMethod)
> - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
> )
> at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
> ... 18 more
> {code}
> On Github can be found 2 commit. 1 initial i add link on it(this one contain 
> sptreaming example). and Last one with batch job example



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2017-03-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20080.
---
Resolution: Invalid

> Spak streaming application do not throw serialisation exception in foreachRDD
> -
>
> Key: SPARK-20080
> URL: https://issues.apache.org/jira/browse/SPARK-20080
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
> Environment: local spark and yarn from big top 1.1.0 version
>Reporter: Nick Hryhoriev
>Priority: Minor
>
> When i try use or init org.slf4j.Logger inside foreachPartition, that 
> extracted to trait method. 
> What was called in foreachRDD. 
> I have found that foreachPartition method do not execute and no exception 
> appeared.
> Tested on local and yarn mode spark.
> code can be found on 
> [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
>  There are two main class that explain problem.
> if i will run same code with batch job. I will get right exception.
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Task not 
> serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
> at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
> at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
> at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
> at ReproduceBugMain.main(ReproduceBugMain.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: java.io.NotSerializableException: ReproduceBugMain$
> Serialization stack:
> - object not serializable (class: ReproduceBugMain$, value: 
> ReproduceBugMain$@3935e9a8)
> - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
> $outer, type: interface TraitWithMethod)
> - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
> )
> at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
> ... 18 more
> {code}
> On Github can be found 2 commit. 1 initial i add link on it(this one contain 
> sptreaming example). and Last one with batch job example



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2017-03-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-20080:
---

> Spak streaming application do not throw serialisation exception in foreachRDD
> -
>
> Key: SPARK-20080
> URL: https://issues.apache.org/jira/browse/SPARK-20080
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
> Environment: local spark and yarn from big top 1.1.0 version
>Reporter: Nick Hryhoriev
>Priority: Minor
>
> When i try use or init org.slf4j.Logger inside foreachPartition, that 
> extracted to trait method. 
> What was called in foreachRDD. 
> I have found that foreachPartition method do not execute and no exception 
> appeared.
> Tested on local and yarn mode spark.
> code can be found on 
> [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
>  There are two main class that explain problem.
> if i will run same code with batch job. I will get right exception.
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Task not 
> serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
> at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
> at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
> at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
> at ReproduceBugMain.main(ReproduceBugMain.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: java.io.NotSerializableException: ReproduceBugMain$
> Serialization stack:
> - object not serializable (class: ReproduceBugMain$, value: 
> ReproduceBugMain$@3935e9a8)
> - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
> $outer, type: interface TraitWithMethod)
> - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
> )
> at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
> ... 18 more
> {code}
> On Github can be found 2 commit. 1 initial i add link on it(this one contain 
> sptreaming example). and Last one with batch job example



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2017-03-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20080.
---
Resolution: Fixed

[~hryhoriev.nick] I can't understand what you're reporting here, and so it's 
not nearly suitable as a JIRA. Please read 
http://spark.apache.org/contributing.html
This is not a place for technical support, but for describing clear issues or 
improvements along with a specific change if possible. If you significantly 
improve the the description here, I will reopen it. Do not reopen this issue on 
your own.

> Spak streaming application do not throw serialisation exception in foreachRDD
> -
>
> Key: SPARK-20080
> URL: https://issues.apache.org/jira/browse/SPARK-20080
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
> Environment: local spark and yarn from big top 1.1.0 version
>Reporter: Nick Hryhoriev
>Priority: Minor
>
> When i try use or init org.slf4j.Logger inside foreachPartition, that 
> extracted to trait method. 
> What was called in foreachRDD. 
> I have found that foreachPartition method do not execute and no exception 
> appeared.
> Tested on local and yarn mode spark.
> code can be found on 
> [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
>  There are two main class that explain problem.
> if i will run same code with batch job. I will get right exception.
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Task not 
> serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
> at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
> at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
> at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
> at ReproduceBugMain.main(ReproduceBugMain.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: java.io.NotSerializableException: ReproduceBugMain$
> Serialization stack:
> - object not serializable (class: ReproduceBugMain$, value: 
> ReproduceBugMain$@3935e9a8)
> - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
> $outer, type: interface TraitWithMethod)
> - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
> )
> at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
> ... 18 more
> {code}
> On Github can be found 2 commit. 1 initial i add link on it(this one contain 
> sptreaming example). and Last one with batch job example



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13644) Add the source file name and line into Logger when an exception occurs in the generated code

2017-03-24 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940078#comment-15940078
 ] 

Takeshi Yamamuro commented on SPARK-13644:
--

Could we close this? I think this issue has already resolved?

> Add the source file name and line into Logger when an exception occurs in the 
> generated code
> 
>
> Key: SPARK-13644
> URL: https://issues.apache.org/jira/browse/SPARK-13644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> This is to show a message points out the origin of a generated method when an 
> exception occurs in the generated method at runtime.
> An example of a message (the first line is newly added)
> {code}
> 07:49:29.525 ERROR 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator: 
> The method GeneratedIterator.processNext() is generated for filter at 
> Test.scala:23
> 07:49:29.526 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 2.0 (TID 4)
> java.lang.NullPointerException:
> at ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13644) Add the source file name and line into Logger when an exception occurs in the generated code

2017-03-24 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940078#comment-15940078
 ] 

Takeshi Yamamuro edited comment on SPARK-13644 at 3/24/17 9:51 AM:
---

Could we close this? I think this issue has already been resolved?


was (Author: maropu):
Could we close this? I think this issue has already resolved?

> Add the source file name and line into Logger when an exception occurs in the 
> generated code
> 
>
> Key: SPARK-13644
> URL: https://issues.apache.org/jira/browse/SPARK-13644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> This is to show a message points out the origin of a generated method when an 
> exception occurs in the generated method at runtime.
> An example of a message (the first line is newly added)
> {code}
> 07:49:29.525 ERROR 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator: 
> The method GeneratedIterator.processNext() is generated for filter at 
> Test.scala:23
> 07:49:29.526 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 2.0 (TID 4)
> java.lang.NullPointerException:
> at ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10719) SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10

2017-03-24 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940076#comment-15940076
 ] 

Takeshi Yamamuro commented on SPARK-10719:
--

I added a link to SPARK-19810 and we could close this if SPARK-19810 resolved.

> SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10
> --
>
> Key: SPARK-10719
> URL: https://issues.apache.org/jira/browse/SPARK-10719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0, 1.6.0
> Environment: Scala 2.10
>Reporter: Shixiong Zhu
>
> Sometimes the following codes failed
> {code}
> val conf = new SparkConf().setAppName("sql-memory-leak")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> (1 to 1000).par.foreach { _ =>
>   sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
> }
> {code}
> The stack trace is 
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: tail of 
> empty list
>   at scala.collection.immutable.Nil$.tail(List.scala:339)
>   at scala.collection.immutable.Nil$.tail(List.scala:334)
>   at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
>   at 
> scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477)
>   at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777)
>   at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235)
>   at 
> scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34)
>   at 
> scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61)
>   at 
> scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
>   at 
> scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
>   at SparkApp$$anonfun$main$1.apply$mcJI$sp(SparkApp.scala:16)
>   at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15)
>   at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15)
>   at scala.Function1$class.apply$mcVI$sp(Function1.scala:39)
>   at 
> scala.runtime.AbstractFunction1.apply$mcVI$sp(AbstractFunction1.scala:12)
>   at 
> scala.collection.parallel.immutable.ParRange$ParRangeIterator.foreach(ParRange.scala:91)
>   at 
> scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:975)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>   at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>   at 
> scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:972)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:172)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>   at 
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> Finally, I found the problem. The codes generated by Scala compiler to find 
> the implicit TypeTag are not thread safe because of an issue in Scala 2.10: 
> https://issues.scala-lang.org/browse/SI-6240
> This issue was fixed in Scala 2.11 but not backported to 2.10.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2017-03-24 Thread Nick Hryhoriev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev updated SPARK-20080:
---
Summary: Spak streaming application do not throw serialisation exception in 
foreachRDD  (was: Spak streaming up do not throw serialisation exception in 
foreachRDD)

> Spak streaming application do not throw serialisation exception in foreachRDD
> -
>
> Key: SPARK-20080
> URL: https://issues.apache.org/jira/browse/SPARK-20080
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
> Environment: local spark and yarn from big top 1.1.0 version
>Reporter: Nick Hryhoriev
>Priority: Minor
>
> When i try use or init org.slf4j.Logger inside foreachPartition, that 
> extracted to trait method. 
> What was called in foreachRDD. 
> I have found that foreachPartition method do not execute and no exception 
> appeared.
> Tested on local and yarn mode spark.
> code can be found on 
> [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
>  There are two main class that explain problem.
> if i will run same code with batch job. I will get right exception.
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Task not 
> serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
> at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
> at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
> at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
> at ReproduceBugMain.main(ReproduceBugMain.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: java.io.NotSerializableException: ReproduceBugMain$
> Serialization stack:
> - object not serializable (class: ReproduceBugMain$, value: 
> ReproduceBugMain$@3935e9a8)
> - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
> $outer, type: interface TraitWithMethod)
> - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
> )
> at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
> ... 18 more
> {code}
> On Github can be found 2 commit. 1 initial i add link on it(this one contain 
> sptreaming example). and Last one with batch job example



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 127 matches

Mail list logo