[jira] [Created] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true

2015-09-01 Thread Vinod KC (JIRA)
Vinod KC created SPARK-10414:


 Summary: DenseMatrix gives different hashcode even though equals 
returns true
 Key: SPARK-10414
 URL: https://issues.apache.org/jira/browse/SPARK-10414
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Vinod KC
Priority: Minor


hashcode implementation in DenseMatrix gives different result for same input

val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0))
assert(dm1 === dm) // passed
assert(dm1.hashCode === dm.hashCode) // Failed

This violates the hashCode/equals contract.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9718) LinearRegressionTrainingSummary should hold all columns in transformed data

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9718:
---

Assignee: Apache Spark

> LinearRegressionTrainingSummary should hold all columns in transformed data
> ---
>
> Key: SPARK-9718
> URL: https://issues.apache.org/jira/browse/SPARK-9718
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> LinearRegression training summary: The transformed dataset should hold all 
> columns, not just selected ones like prediction and label.  There is no real 
> need to remove some, and the user may find them useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9718) LinearRegressionTrainingSummary should hold all columns in transformed data

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726845#comment-14726845
 ] 

Apache Spark commented on SPARK-9718:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8564

> LinearRegressionTrainingSummary should hold all columns in transformed data
> ---
>
> Key: SPARK-9718
> URL: https://issues.apache.org/jira/browse/SPARK-9718
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> LinearRegression training summary: The transformed dataset should hold all 
> columns, not just selected ones like prediction and label.  There is no real 
> need to remove some, and the user may find them useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9718) LinearRegressionTrainingSummary should hold all columns in transformed data

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9718:
---

Assignee: (was: Apache Spark)

> LinearRegressionTrainingSummary should hold all columns in transformed data
> ---
>
> Key: SPARK-9718
> URL: https://issues.apache.org/jira/browse/SPARK-9718
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> LinearRegression training summary: The transformed dataset should hold all 
> columns, not just selected ones like prediction and label.  There is no real 
> need to remove some, and the user may find them useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9722) Pass random seed to spark.ml DecisionTree*

2015-09-01 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726844#comment-14726844
 ] 

holdenk commented on SPARK-9722:


I can do this if no one else is working on it :)

> Pass random seed to spark.ml DecisionTree*
> --
>
> Key: SPARK-9722
> URL: https://issues.apache.org/jira/browse/SPARK-9722
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> Trees use XORShiftRandom when binning continuous features.  Currently, they 
> use a fixed seed of 1.  They should accept a random seed param and use that 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10288) Add a rest client for Spark on Yarn

2015-09-01 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-10288:

Description: 
This is a proposal to add rest client for Spark on Yarn. Rest API offers a 
convenient addition to let user to submit application through rest client, 
people will easily achieve long haul submission, build their own submission 
gateway through rest client.

Here is the design doc 
(https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing).

Currently I'm working on it, working branch is 
(https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major 
part is already finished.

Any comment is greatly appreciated, thanks a lot.

  was:
This is a proposal to add rest client for Spark on Yarn. Currently Spark 
standalone and Mesos mode can support rest way of submitting applications, for 
Spark on Yarn, it still uses program way to do it. Since RM now (from 2.6) 
supports rest way of submitting application, so it would be better Spark on 
Yarn also support this way.

Here is the design doc 
(https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing).

Currently I'm working on it, working branch is 
(https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major 
part is already finished.

Any comment is greatly appreciated, thanks a lot.


> Add a rest client for Spark on Yarn
> ---
>
> Key: SPARK-10288
> URL: https://issues.apache.org/jira/browse/SPARK-10288
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> This is a proposal to add rest client for Spark on Yarn. Rest API offers a 
> convenient addition to let user to submit application through rest client, 
> people will easily achieve long haul submission, build their own submission 
> gateway through rest client.
> Here is the design doc 
> (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing).
> Currently I'm working on it, working branch is 
> (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major 
> part is already finished.
> Any comment is greatly appreciated, thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page

2015-09-01 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-8694.
---
Resolution: Won't Fix

> Defer executing drawTaskAssignmentTimeline until page loaded to avoid to 
> freeze the page
> 
>
> Key: SPARK-8694
> URL: https://issues.apache.org/jira/browse/SPARK-8694
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Kousuke Saruta
>
> When there are massive tasks in the stage page (such as, running 
> sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds 
> to render the graph (drawTaskAssignmentTimeline) in my environment. The page 
> is unresponsive until the graph is ready.
> However, since Event Timeline is hidden by default, we can defer 
> drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So 
> that the user can view the page while rendering Event Timeline in the 
> background.
> This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid 
> blocking loading page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page

2015-09-01 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726815#comment-14726815
 ] 

Kousuke Saruta commented on SPARK-8694:
---

Now this issue is addressed by the pagination.

> Defer executing drawTaskAssignmentTimeline until page loaded to avoid to 
> freeze the page
> 
>
> Key: SPARK-8694
> URL: https://issues.apache.org/jira/browse/SPARK-8694
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Kousuke Saruta
>
> When there are massive tasks in the stage page (such as, running 
> sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds 
> to render the graph (drawTaskAssignmentTimeline) in my environment. The page 
> is unresponsive until the graph is ready.
> However, since Event Timeline is hidden by default, we can defer 
> drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So 
> that the user can view the page while rendering Event Timeline in the 
> background.
> This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid 
> blocking loading page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8402) DP means clustering

2015-09-01 Thread Meethu Mathew (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meethu Mathew updated SPARK-8402:
-
Description: 
At present, all the clustering algorithms in MLlib require the number of 
clusters to be specified in advance. 
The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
that allows for flexible clustering of data without having to specify apriori 
the number of clusters. 
DP means is a non-parametric clustering algorithm that uses a scale parameter 
'lambda' to control the creation of new clusters ["Revisiting k-means: New 
Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].

We have followed the distributed implementation of DP means which has been 
proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
by Xinghao Pan, Evan R. Sparks, Andre Wibisono.

A benchmark comparison between k-means and dp-means based on Normalized Mutual 
Information between ground truth clusters and algorithm outputs, have been 
provided in the following table. It can be seen from the table that DP-means 
reported a higher NMI on 5 of 8 data sets in comparison to k-means[Source: 
Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian 
nonparametrics (2011) Arxiv:.0352. (Table 1)]

| Dataset   | DP-means | k-means |
| Wine  | .41  | .43 |
| Iris  | .75  | .76 |
| Pima  | .02  | .03 |
| Soybean   | .72  | .66 |
| Car   | .07  | .05 |
| Balance Scale | .17  | .11 |
| Breast Cancer | .04  | .03 |
| Vehicle   | .18  | .18 |

Experiment on our spark cluster setup:

An initial benchmark study was performed on a 3 node Spark cluster setup on 
mesos where each node config was 8 Cores, 64 GB RAM and the spark version used 
was 1.5(git branch).

Tests were done using a mixture of 10 Gaussians with varying number of features 
and instances. The results from the benchmark study are provided below. The 
reported stats are average over 5 runs. 

| DATASET || DPMEANS |   |  
   | KMEANS (k =10) | |
| Instances   | Dimensions | No of clusters obtained | Time  | Converged in 
iterations |  Time  | Converged in iterations |
|  10 million | 10 |10   | 43.6s |2 
   |  52.2s |2|
|  1 million  | 100|10   | 39.8s |2 
   | 43.39s |2|
| 0.1 million |1000|10   | 37.3s |2 
   | 41.64s |2|

  was:
At present, all the clustering algorithms in MLlib require the number of 
clusters to be specified in advance. 
The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
that allows for flexible clustering of data without having to specify apriori 
the number of clusters. 
DP means is a non-parametric clustering algorithm that uses a scale parameter 
'lambda' to control the creation of new clusters["Revisiting k-means: New 
Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].

We have followed the distributed implementation of DP means which has been 
proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
by Xinghao Pan, Evan R. Sparks, Andre Wibisono.


> DP means clustering 
> 
>
> Key: SPARK-8402
> URL: https://issues.apache.org/jira/browse/SPARK-8402
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Meethu Mathew
>Assignee: Meethu Mathew
>  Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of 
> clusters to be specified in advance. 
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
> that allows for flexible clustering of data without having to specify apriori 
> the number of clusters. 
> DP means is a non-parametric clustering algorithm that uses a scale parameter 
> 'lambda' to control the creation of new clusters ["Revisiting k-means: New 
> Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been 
> proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
> by Xinghao Pan, Evan R. Sparks, Andre Wibisono.
> A benchmark comparison between k-means and dp-means based on Normalized 
> Mutual Information between ground truth clusters and algorithm outputs, have 
> been provided in the following table. It can be seen from the table that 
> DP-means reported a higher NMI on 5 of 8 data sets in comparison to 
> k-means[Source: Kulis, B., Jordan,

[jira] [Commented] (SPARK-10288) Add a rest client for Spark on Yarn

2015-09-01 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726814#comment-14726814
 ] 

Saisai Shao commented on SPARK-10288:
-

Hi [~vanzin], thanks a lot for your comments. Yes it doesn't make sense to 
compare Yarn with Standalone and Mesos, and the protocol is also different 
compared to other two cluster manager. I will update the description. But to 
some extent I think rest client is still meaningful as [~ste...@apache.org] 
mentioned.

If you have any suggestion please let me know, thanks a lot.

> Add a rest client for Spark on Yarn
> ---
>
> Key: SPARK-10288
> URL: https://issues.apache.org/jira/browse/SPARK-10288
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> This is a proposal to add rest client for Spark on Yarn. Currently Spark 
> standalone and Mesos mode can support rest way of submitting applications, 
> for Spark on Yarn, it still uses program way to do it. Since RM now (from 
> 2.6) supports rest way of submitting application, so it would be better Spark 
> on Yarn also support this way.
> Here is the design doc 
> (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing).
> Currently I'm working on it, working branch is 
> (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major 
> part is already finished.
> Any comment is greatly appreciated, thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8469) Application timeline view unreadable with many executors

2015-09-01 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726812#comment-14726812
 ] 

Kousuke Saruta commented on SPARK-8469:
---

Thanks for investigating the use case of the timeline view with dynamic 
allocation.
I understand that showing the last N is not meaningful.
Unfortunately, I don't have enough time for considering much better solution 
until the end of October.
I'll try to address this issue after the end of October.

> Application timeline view unreadable with many executors
> 
>
> Key: SPARK-8469
> URL: https://issues.apache.org/jira/browse/SPARK-8469
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Kousuke Saruta
> Attachments: Screen Shot 2015-06-18 at 5.51.21 PM.png
>
>
> This is a problem with using dynamic allocation with many executors. See 
> screenshot. We may want to limit the number of stacked events somehow. See 
> screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9717) Add persistence to MulticlassMetrics

2015-09-01 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726803#comment-14726803
 ] 

holdenk commented on SPARK-9717:


I was looking at this, but it seems like it doesn't make as much sense since 
there isn't an internal RDD.

> Add persistence to MulticlassMetrics
> 
>
> Key: SPARK-9717
> URL: https://issues.apache.org/jira/browse/SPARK-9717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add RDD persistence to MulticlassMetrics internals, following the example of 
> BinaryClassificationMetrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3871) compute-classpath.sh does not escape :

2015-09-01 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726785#comment-14726785
 ] 

Iulian Dragos commented on SPARK-3871:
--

There's no more compute-classpath.sh. Ok to close this?

> compute-classpath.sh does not escape :
> --
>
> Key: SPARK-3871
> URL: https://issues.apache.org/jira/browse/SPARK-3871
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.1.0
>Reporter: Hector Yee
>Priority: Minor
>
> Chronos jobs on Mesos schedule jobs in temp directories such as
> /tmp/mesos/slaves/20140926-142803-3852091146-5050-3487-375/frameworks/20140719-203536-160311562-5050-10655-0007/executors/ct:1412815902180:2:search_ranking_scoring/runs/f1e0d058-3ef0-4838-816e-e3fa5e179dd8
> The compute-classpath.sh does not properly escape the : in the temp dirs 
> generated by mesos and so the spark-submit gets a broken classpath



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-09-01 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726784#comment-14726784
 ] 

Iulian Dragos commented on SPARK-4940:
--

Would it make sense to allocate resources in a round-robin fashion? Supposing 
Spark gets several offers at the same time, it would have enough info to 
balance executors on the available resources (or optionally, define an interval 
during which it holds on to resources it receives to accumulate a larger set of 
slaves).

The algorithm may proceed by allocating a multiple `spark.task.cores` (below 
cap, see SPARK-9873 which might help on its own) on each slave in the set of 
resources, until it can't allocate anymore.

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8514) LU factorization on BlockMatrix

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8514:
---

Assignee: (was: Apache Spark)

> LU factorization on BlockMatrix
> ---
>
> Key: SPARK-8514
> URL: https://issues.apache.org/jira/browse/SPARK-8514
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>  Labels: advanced
> Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, 
> BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, Matrix 
> Factorization - M...ark 1.5.0 Documentation.pdf, testScript.scala
>
>
> LU is the most common method to solve a general linear system or inverse a 
> general matrix. A distributed version could in implemented block-wise with 
> pipelining. A reference implementation is provided in ScaLAPACK:
> http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726777#comment-14726777
 ] 

Apache Spark commented on SPARK-8514:
-

User 'nilmeier' has created a pull request for this issue:
https://github.com/apache/spark/pull/8563

> LU factorization on BlockMatrix
> ---
>
> Key: SPARK-8514
> URL: https://issues.apache.org/jira/browse/SPARK-8514
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>  Labels: advanced
> Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, 
> BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, Matrix 
> Factorization - M...ark 1.5.0 Documentation.pdf, testScript.scala
>
>
> LU is the most common method to solve a general linear system or inverse a 
> general matrix. A distributed version could in implemented block-wise with 
> pipelining. A reference implementation is provided in ScaLAPACK:
> http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8514) LU factorization on BlockMatrix

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8514:
---

Assignee: Apache Spark

> LU factorization on BlockMatrix
> ---
>
> Key: SPARK-8514
> URL: https://issues.apache.org/jira/browse/SPARK-8514
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: advanced
> Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, 
> BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, Matrix 
> Factorization - M...ark 1.5.0 Documentation.pdf, testScript.scala
>
>
> LU is the most common method to solve a general linear system or inverse a 
> general matrix. A distributed version could in implemented block-wise with 
> pipelining. A reference implementation is provided in ScaLAPACK:
> http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job

2015-09-01 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726775#comment-14726775
 ] 

Iulian Dragos commented on SPARK-7874:
--

[~tomdz] do you mean respecting `spark.cores.max`, as it is the case in 
coarse-grained mode?

> Add a global setting for the fine-grained mesos scheduler that limits the 
> number of concurrent tasks of a job
> -
>
> Key: SPARK-7874
> URL: https://issues.apache.org/jira/browse/SPARK-7874
> Project: Spark
>  Issue Type: Wish
>  Components: Mesos
>Affects Versions: 1.3.1
>Reporter: Thomas Dudziak
>Priority: Minor
>
> This would be a very simple yet effective way to prevent a job dominating the 
> cluster. A way to override it per job would also be nice but not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8514) LU factorization on BlockMatrix

2015-09-01 Thread Jerome (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerome updated SPARK-8514:
--
Attachment: Matrix Factorization - M...ark 1.5.0 Documentation.pdf

I added a version of the Documentation that contains some of the design 
documentation for the LU algorithm.  Some of the descriptions may not be 
necessary for Spark users, but could be useful for reviewers.  Cheers, Jerome

> LU factorization on BlockMatrix
> ---
>
> Key: SPARK-8514
> URL: https://issues.apache.org/jira/browse/SPARK-8514
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>  Labels: advanced
> Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, 
> BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, Matrix 
> Factorization - M...ark 1.5.0 Documentation.pdf, testScript.scala
>
>
> LU is the most common method to solve a general linear system or inverse a 
> general matrix. A distributed version could in implemented block-wise with 
> pipelining. A reference implementation is provided in ScaLAPACK:
> http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10324:
--
Description: 
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spark 1.6. Please view this list as a wish list rather than a concrete 
plan, because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add `@Since("1.6.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* log-linear model for survival analysis (SPARK-8518)
* normal equation approach for linear regression (SPARK-9834)
* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* bisecting k-means (SPARK-6517)
* weighted instance support (SPARK-9610)
** logistic regression (SPARK-7685)
** linear regression (SPARK-9642)
** random forest (SPARK-9478)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-2352)
** autoencoder (SPARK-4288)
** restricted Boltzmann machine (RBM) (SPARK-4251)
** convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* local linear algebra (SPARK-6442)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* univariate statistics as UDAFs (SPARK-10384)
* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* online hypothesis testing (SPARK-3147)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
* ML attribute API improvements (SPARK-8515)
* feature transformers (SPARK-9930)
** feature interaction (SPARK-9698)
** SQL transformer (SPARK-8345)
** ??
* predict single instance (SPARK-10413)
* test Kaggle datasets (SPARK-9941)

h2. Model persistence

* PMML export
** naive Bayes (SPARK-8546)
** decision tree (SPARK-8542)
* model save/load
** FPGrowth (SPARK-6724)
** PrefixSpan (SPARK-10386)
* code generation
** decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* LIBSVM data source (SPARK-10117)
* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API. You 
can find a complete list 
[here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall 
into two major categories:

* Python API for new algorithms
* Python API for missing methods

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-9838, 
SPARK-9839, SPARK-9840)
* better R formula support (SPARK-9681)
* model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)

h2. Documentation

* re-organize user guide (SPARK-8517)
* @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPAR

[jira] [Created] (SPARK-10413) Model should support prediction on single instance

2015-09-01 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10413:
-

 Summary: Model should support prediction on single instance
 Key: SPARK-10413
 URL: https://issues.apache.org/jira/browse/SPARK-10413
 Project: Spark
  Issue Type: Umbrella
  Components: ML
Reporter: Xiangrui Meng
Priority: Critical


Currently models in the pipeline API only implement transform(DataFrame). It 
would be quite useful to support prediction on single instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9595) Adding API to SparkConf for kryo serializers registration

2015-09-01 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726751#comment-14726751
 ] 

holdenk commented on SPARK-9595:


I can do this if no one else is working on it.

> Adding API to SparkConf for kryo serializers registration
> -
>
> Key: SPARK-9595
> URL: https://issues.apache.org/jira/browse/SPARK-9595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1
>Reporter: John Chen
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Currently SparkConf has a registerKryoClasses API for kryo registration. 
> However, this only works when you register classes. If you want to register 
> customized kryo serializers, you'll have to extend the KryoSerializer class 
> and write some codes.
> This is not only very inconvenient, but also require the registration to be 
> done in compile-time, which is not always possible. Thus, I suggest another 
> API to SparkConf for registering customized kryo serializers. It could be 
> like this:
> def registerKryoSerializers(serializers: Map[Class[_], Serializer]): SparkConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-09-01 Thread Nick Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726736#comment-14726736
 ] 

Nick Xie commented on SPARK-3655:
-

I did exactly that, since I will always provide a comparator, I also took the 
liberty of removing a few overloaded constructors.  Less is more when it comes 
to code maintenance.

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-09-01 Thread Koert Kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726664#comment-14726664
 ] 

Koert Kuipers commented on SPARK-3655:
--

Did you build a version that does not use Optional for java api?

[
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726658#comment-14726658
]

Nick Xie commented on SPARK-3655:
-

Thanks for the quick changes to rid of Ordering dependency.  Since I am
only using it in a specific way, through a few small hacks I was able to
rid of the the entire runtime dependency on Guava.

soon? There are some use cases where getting a sorted iterator of values
per key is helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-09-01 Thread Nick Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726658#comment-14726658
 ] 

Nick Xie commented on SPARK-3655:
-

Thanks for the quick changes to rid of Ordering dependency.  Since I am only 
using it in a specific way, through a few small hacks I was able to rid of the 
the entire runtime dependency on Guava.

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10410) spark 1.4.1 kill command does not work with streaming job.

2015-09-01 Thread Bryce Ageno (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryce Ageno updated SPARK-10410:

Shepherd:   (was: Bryce Ageno)

> spark 1.4.1 kill command does not work with streaming job.
> --
>
> Key: SPARK-10410
> URL: https://issues.apache.org/jira/browse/SPARK-10410
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.4.1
>Reporter: Bryce Ageno
>
> Our team recently upgraded a cluster to 1.4.1 from 1.3.1 and we discovered 
> that when you run the kill command for a driver (/usr/spark/bin/spark-submit 
> --master spark://$SPARK_MASTER_IP:6066 --kill $SPARK_DRIVER) it is not 
> removing the driver off of the sparkUI.  It is a streaming job and the kill 
> command "ends" the job but it does not free up the resources or remove it 
> from the spark master.
> We are running in cluster mode.  We have also noticed that with 1.4.1 
> multiple spark-submits all of the drivers ends up on a single worker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10034) add regression test for Sort on Aggregate

2015-09-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10034:

Description: 
Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use 
aggregate expressions named `_aggOrdering` and can't use more than one ordering 
expressions which contains aggregate functions. The reason of this bug is that: 
The aggregate expression in `SortOrder` never get resolved, we alias it with 
`_aggOrdering` and call `toAttribute` which gives us an `UnresolvedAttribute`. 
So actually we are referencing aggregate expression by name, not by exprId like 
we thought. And if there is already an aggregate expression named 
`_aggOrdering` or there are more than one ordering expressions having aggregate 
functions, we will have conflict names and can't search by name.

However, after #8371 got merged, the `SortOrder`s are guaranteed to be resolved 
and we are always referencing aggregate expression by exprId. The Bug doesn't 
exist anymore and this PR add regression tests for it.

  was:
{code}
val df = Seq(1 -> 2).toDF("i", "j")
val query = df.groupBy('i)
  .agg(max('j).as("_aggOrdering"))
  .orderBy(sum('j))
checkAnswer(query, Row(1, 2))
{code}


> add regression test for Sort on Aggregate
> -
>
> Key: SPARK-10034
> URL: https://issues.apache.org/jira/browse/SPARK-10034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use 
> aggregate expressions named `_aggOrdering` and can't use more than one 
> ordering expressions which contains aggregate functions. The reason of this 
> bug is that: The aggregate expression in `SortOrder` never get resolved, we 
> alias it with `_aggOrdering` and call `toAttribute` which gives us an 
> `UnresolvedAttribute`. So actually we are referencing aggregate expression by 
> name, not by exprId like we thought. And if there is already an aggregate 
> expression named `_aggOrdering` or there are more than one ordering 
> expressions having aggregate functions, we will have conflict names and can't 
> search by name.
> However, after #8371 got merged, the `SortOrder`s are guaranteed to be 
> resolved and we are always referencing aggregate expression by exprId. The 
> Bug doesn't exist anymore and this PR add regression tests for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10034) add regression test for Sort on Aggregate

2015-09-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10034:

Summary: add regression test for Sort on Aggregate  (was: add regression 
test for sort on )

> add regression test for Sort on Aggregate
> -
>
> Key: SPARK-10034
> URL: https://issues.apache.org/jira/browse/SPARK-10034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> {code}
> val df = Seq(1 -> 2).toDF("i", "j")
> val query = df.groupBy('i)
>   .agg(max('j).as("_aggOrdering"))
>   .orderBy(sum('j))
> checkAnswer(query, Row(1, 2))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10034) add regression test for sort on

2015-09-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10034:

Summary: add regression test for sort on   (was: Can't analyze Sort on 
Aggregate with aggregation expression named "_aggOrdering")

> add regression test for sort on 
> 
>
> Key: SPARK-10034
> URL: https://issues.apache.org/jira/browse/SPARK-10034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> {code}
> val df = Seq(1 -> 2).toDF("i", "j")
> val query = df.groupBy('i)
>   .agg(max('j).as("_aggOrdering"))
>   .orderBy(sum('j))
> checkAnswer(query, Row(1, 2))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10412) In SQL tab, show execution memory per physical operator

2015-09-01 Thread Andrew Or (JIRA)
Andrew Or created SPARK-10412:
-

 Summary: In SQL tab, show execution memory per physical operator
 Key: SPARK-10412
 URL: https://issues.apache.org/jira/browse/SPARK-10412
 Project: Spark
  Issue Type: Bug
  Components: SQL, Web UI
Affects Versions: 1.5.0
Reporter: Andrew Or


We already display it per task / stage. It's really useful to also display it 
per operator so the user can know which one caused all the memory to be 
allocated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10411) In SQL tab move visualization above explain output

2015-09-01 Thread Andrew Or (JIRA)
Andrew Or created SPARK-10411:
-

 Summary: In SQL tab move visualization above explain output
 Key: SPARK-10411
 URL: https://issues.apache.org/jira/browse/SPARK-10411
 Project: Spark
  Issue Type: Bug
  Components: SQL, Web UI
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Shixiong Zhu


Request from [~pwendell]:

(1) The visualization is much more interesting than the DF explain output. That 
should be at the top of the page.

(2) The DF explain output is for advanced users and should be collapsed by 
default

These are just minor UX optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size

2015-09-01 Thread Xiaoyu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726557#comment-14726557
 ] 

Xiaoyu Wang commented on SPARK-10314:
-

I resubmit the pull request on the master branch

> [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception 
> when parallelism is big than data split size
> 
>
> Key: SPARK-10314
> URL: https://issues.apache.org/jira/browse/SPARK-10314
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.4.1
> Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4
>Reporter: Xiaoyu Wang
>Priority: Minor
>
> RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when 
> parallelism is big than data split size
> {code}
> val rdd = sc.parallelize(List(1, 2),2)
> rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
> rdd.count()
> {code}
> is ok.
> {code}
> val rdd = sc.parallelize(List(1, 2),3)
> rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
> rdd.count()
> {code}
> got exceptoin:
> {noformat}
> 15/08/27 17:53:07 INFO SparkContext: Starting job: count at :24
> 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at :24) with 3 
> output partitions (allowLocal=false)
> 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at 
> :24)
> 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List()
> 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List()
> 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 
> (ParallelCollectionRDD[0] at parallelize at :21), which has no 
> missing parents
> 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with 
> curMem=0, maxMem=741196431
> 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 1096.0 B, free 706.9 MB)
> 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with 
> curMem=1096, maxMem=741196431
> 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 788.0 B, free 706.9 MB)
> 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43776 (size: 788.0 B, free: 706.9 MB)
> 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:874
> 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from 
> ResultStage 0 (ParallelCollectionRDD[0] at parallelize at :21)
> 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1269 bytes)
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, PROCESS_LOCAL, 1270 bytes)
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 
> localhost, PROCESS_LOCAL, 1270 bytes)
> 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
> 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it
> 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started
> 15/08/27 17:53:08 WARN : tachyon.home is not set. Using 
> /mnt/tachyon_default_home as the default value.
> 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect 
> master @ localhost/127.0.0.1:19998
> 15/08/27 17:53:08 INFO : User registered at the master 
> localhost/127.0.0.1:19998 got UserId 109
> 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at 
> /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5
> 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost
> 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998
> 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was 
> created!
> 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 
> was created!
> 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 
> was created!
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore 
> on localhost:43776 (size: 0.0 B)
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore 
> on localhost:43776 (size: 2.0 B)
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_2 on ExternalBlockStore 
> on localhost:43776 (size: 2.0 B)
> 15/08/27 17:53:08 INFO BlockManager: 

[jira] [Comment Edited] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size

2015-09-01 Thread Xiaoyu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726557#comment-14726557
 ] 

Xiaoyu Wang edited comment on SPARK-10314 at 9/2/15 1:33 AM:
-

I resubmit the pull request on the master branch
https://github.com/apache/spark/pull/8562


was (Author: wangxiaoyu):
I resubmit the pull request on the master branch

> [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception 
> when parallelism is big than data split size
> 
>
> Key: SPARK-10314
> URL: https://issues.apache.org/jira/browse/SPARK-10314
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.4.1
> Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4
>Reporter: Xiaoyu Wang
>Priority: Minor
>
> RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when 
> parallelism is big than data split size
> {code}
> val rdd = sc.parallelize(List(1, 2),2)
> rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
> rdd.count()
> {code}
> is ok.
> {code}
> val rdd = sc.parallelize(List(1, 2),3)
> rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
> rdd.count()
> {code}
> got exceptoin:
> {noformat}
> 15/08/27 17:53:07 INFO SparkContext: Starting job: count at :24
> 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at :24) with 3 
> output partitions (allowLocal=false)
> 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at 
> :24)
> 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List()
> 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List()
> 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 
> (ParallelCollectionRDD[0] at parallelize at :21), which has no 
> missing parents
> 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with 
> curMem=0, maxMem=741196431
> 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 1096.0 B, free 706.9 MB)
> 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with 
> curMem=1096, maxMem=741196431
> 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 788.0 B, free 706.9 MB)
> 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43776 (size: 788.0 B, free: 706.9 MB)
> 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:874
> 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from 
> ResultStage 0 (ParallelCollectionRDD[0] at parallelize at :21)
> 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1269 bytes)
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, PROCESS_LOCAL, 1270 bytes)
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 
> localhost, PROCESS_LOCAL, 1270 bytes)
> 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
> 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it
> 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started
> 15/08/27 17:53:08 WARN : tachyon.home is not set. Using 
> /mnt/tachyon_default_home as the default value.
> 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect 
> master @ localhost/127.0.0.1:19998
> 15/08/27 17:53:08 INFO : User registered at the master 
> localhost/127.0.0.1:19998 got UserId 109
> 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at 
> /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5
> 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost
> 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998
> 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was 
> created!
> 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 
> was created!
> 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 
> was created!
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore 
> on localhost:43776 (size: 0.0 B)
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore 
> on localhost:43776 (s

[jira] [Updated] (SPARK-4122) Add library to write data back to Kafka

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4122:
-
Target Version/s: 1.6.0

> Add library to write data back to Kafka
> ---
>
> Key: SPARK-4122
> URL: https://issues.apache.org/jira/browse/SPARK-4122
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726551#comment-14726551
 ] 

Apache Spark commented on SPARK-10314:
--

User 'romansew' has created a pull request for this issue:
https://github.com/apache/spark/pull/8562

> [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception 
> when parallelism is big than data split size
> 
>
> Key: SPARK-10314
> URL: https://issues.apache.org/jira/browse/SPARK-10314
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.4.1
> Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4
>Reporter: Xiaoyu Wang
>Priority: Minor
>
> RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when 
> parallelism is big than data split size
> {code}
> val rdd = sc.parallelize(List(1, 2),2)
> rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
> rdd.count()
> {code}
> is ok.
> {code}
> val rdd = sc.parallelize(List(1, 2),3)
> rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
> rdd.count()
> {code}
> got exceptoin:
> {noformat}
> 15/08/27 17:53:07 INFO SparkContext: Starting job: count at :24
> 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at :24) with 3 
> output partitions (allowLocal=false)
> 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at 
> :24)
> 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List()
> 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List()
> 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 
> (ParallelCollectionRDD[0] at parallelize at :21), which has no 
> missing parents
> 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with 
> curMem=0, maxMem=741196431
> 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 1096.0 B, free 706.9 MB)
> 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with 
> curMem=1096, maxMem=741196431
> 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 788.0 B, free 706.9 MB)
> 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43776 (size: 788.0 B, free: 706.9 MB)
> 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:874
> 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from 
> ResultStage 0 (ParallelCollectionRDD[0] at parallelize at :21)
> 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1269 bytes)
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, PROCESS_LOCAL, 1270 bytes)
> 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 
> localhost, PROCESS_LOCAL, 1270 bytes)
> 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
> 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it
> 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it
> 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started
> 15/08/27 17:53:08 WARN : tachyon.home is not set. Using 
> /mnt/tachyon_default_home as the default value.
> 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect 
> master @ localhost/127.0.0.1:19998
> 15/08/27 17:53:08 INFO : User registered at the master 
> localhost/127.0.0.1:19998 got UserId 109
> 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at 
> /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5
> 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost
> 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998
> 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was 
> created!
> 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 
> was created!
> 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 
> was created!
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore 
> on localhost:43776 (size: 0.0 B)
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore 
> on localhost:43776 (size: 2.0 B)
> 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_2 on ExternalBlockStore 
> on localhost:43776

[jira] [Updated] (SPARK-3586) Support nested directories in Spark Streaming

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3586:
-
Target Version/s: 1.6.0

> Support nested directories in Spark Streaming
> -
>
> Key: SPARK-3586
> URL: https://issues.apache.org/jira/browse/SPARK-3586
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Minor
>
> For  text files, the method streamingContext.textFileStream(dataDirectory). 
> Spark Streaming will monitor the directory dataDirectory and process any 
> files created in that directory.but files written in nested directories not 
> supported
> eg
> streamingContext.textFileStream(/test). 
> Look at the direction contents:
> /test/file1
> /test/file2
> /test/dr/file1
> In this mothod the "textFileStream" can only read file:
> /test/file1
> /test/file2
> /test/dr/
> but the file: /test/dr/file1  is not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3586) Support nested directories in Spark Streaming

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3586:
-
Labels:   (was: patch)

> Support nested directories in Spark Streaming
> -
>
> Key: SPARK-3586
> URL: https://issues.apache.org/jira/browse/SPARK-3586
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Minor
>
> For  text files, the method streamingContext.textFileStream(dataDirectory). 
> Spark Streaming will monitor the directory dataDirectory and process any 
> files created in that directory.but files written in nested directories not 
> supported
> eg
> streamingContext.textFileStream(/test). 
> Look at the direction contents:
> /test/file1
> /test/file2
> /test/dr/file1
> In this mothod the "textFileStream" can only read file:
> /test/file1
> /test/file2
> /test/dr/
> but the file: /test/dr/file1  is not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423
 ] 

Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM:
---

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp 
(https://github.com/avulanov/spark/blob/autoencoder-mlp/mllib/src/main/scala/org/apache/spark/ml/feature/Autoencoder.scala)


was (Author: avulanov):
Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp (

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423
 ] 

Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM:
---

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp (


was (Author: avulanov):
Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp 
(https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala)

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423
 ] 

Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:54 PM:
---

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp 
(https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala)


was (Author: avulanov):
Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Description: 
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers: 

References: 
1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf

  was:
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf]
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here
3)Denoising autoencoder
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10410) spark 1.4.1 kill command does not work with streaming job.

2015-09-01 Thread Bryce Ageno (JIRA)
Bryce Ageno created SPARK-10410:
---

 Summary: spark 1.4.1 kill command does not work with streaming job.
 Key: SPARK-10410
 URL: https://issues.apache.org/jira/browse/SPARK-10410
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.1
Reporter: Bryce Ageno


Our team recently upgraded a cluster to 1.4.1 from 1.3.1 and we discovered that 
when you run the kill command for a driver (/usr/spark/bin/spark-submit 
--master spark://$SPARK_MASTER_IP:6066 --kill $SPARK_DRIVER) it is not removing 
the driver off of the sparkUI.  It is a streaming job and the kill command 
"ends" the job but it does not free up the resources or remove it from the 
spark master.

We are running in cluster mode.  We have also noticed that with 1.4.1 multiple 
spark-submits all of the drivers ends up on a single worker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10409) Multilayer perceptron regression

2015-09-01 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-10409:


 Summary: Multilayer perceptron regression
 Key: SPARK-10409
 URL: https://issues.apache.org/jira/browse/SPARK-10409
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Alexander Ulanov
Priority: Minor


Implement regression based on multilayer perceptron (MLP). It should support 
different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. The 
implementation might take advantage of autoencoder. Time-series forecasting for 
financial data might be one of the use cases, see 
http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10409) Multilayer perceptron regression

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726435#comment-14726435
 ] 

Alexander Ulanov commented on SPARK-10409:
--

Basic implementation with the current ML api can be found here: 
https://github.com/avulanov/spark/blob/a2261330c227be8ef26172dbe355a617d653553a/mllib/src/main/scala/org/apache/spark/ml/regression/MultilayerPerceptronRegressor.scala

> Multilayer perceptron regression
> 
>
> Key: SPARK-10409
> URL: https://issues.apache.org/jira/browse/SPARK-10409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Implement regression based on multilayer perceptron (MLP). It should support 
> different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. 
> The implementation might take advantage of autoencoder. Time-series 
> forecasting for financial data might be one of the use cases, see 
> http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
> specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423
 ] 

Alexander Ulanov commented on SPARK-10408:
--

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf]
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here
> 3)Denoising autoencoder
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Issue Type: Umbrella  (was: Improvement)

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf]
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here
> 3)Denoising autoencoder
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-10408:


 Summary: Autoencoder
 Key: SPARK-10408
 URL: https://issues.apache.org/jira/browse/SPARK-10408
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Alexander Ulanov
Priority: Minor


Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf]
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here
3)Denoising autoencoder
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10387) Code generation for decision tree

2015-09-01 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726392#comment-14726392
 ] 

DB Tsai commented on SPARK-10387:
-

Here is the current research result from us.

We implemented a prototype of code generation for trees, and here is the 
implementation of code-gen.
https://github.com/dbtsai/tree/blob/master/macros/src/main/scala/Tree.scala

1) We found the performance of code-gen is 4x to 6x faster than naive binary 
tree when the # of trees used in GBDT are small. But with around 500x trees, 
the performance is slightly worse. 

2) We're also benchmarking the flatten trees idea described here, 
http://tullo.ch/articles/decision-tree-evaluation/

3) Finally, QuickScorer: A Fast Algorithm to Rank Documents with Additive 
Ensembles of Regression Trees
http://delivery.acm.org/10.1145/277/2767733/p73-lucchese.pdf is being 
implemented, we will benchmark it as well. 

> Code generation for decision tree
> -
>
> Key: SPARK-10387
> URL: https://issues.apache.org/jira/browse/SPARK-10387
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>
> Provide code generation for decision tree and tree ensembles. Let's first 
> discuss the design and then create new JIRAs for tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8534) Gini for regression metrics and evaluator

2015-09-01 Thread Ehsan Mohyedin Kermani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726364#comment-14726364
 ] 

Ehsan Mohyedin Kermani commented on SPARK-8534:
---

I'd like to give it a shot but first I think, we need distributed scan function 
for computing the cumulative sum of the sorted predictions. Would it be 
possible to add that to RegressionMetrics or perhaps mllib.util first? An 
implementation was suggested here 
https://groups.google.com/forum/#!topic/spark-users/ts-FdB50ltY. 

> Gini for regression metrics and evaluator
> -
>
> Key: SPARK-8534
> URL: https://issues.apache.org/jira/browse/SPARK-8534
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> One common metric we do not have in RegressionMetrics or RegressionEvaluator 
> is Gini: [https://www.kaggle.com/wiki/Gini]
> Implementing (normalized) Gini would be nice.  However, it might be 
> expensive; I believe it would require sorting the labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10405) Support takeOrdered and topK values per key

2015-09-01 Thread ashish shenoy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726331#comment-14726331
 ] 

ashish shenoy edited comment on SPARK-10405 at 9/1/15 10:39 PM:


[~srowen] yes, technically its a good to have not a must have. I can think of 
many instances where such an API would be very convenient and useful for users 
to use. 

I was using the aggregateByKey() with a custom written bounded priority queue. 
As per the spark documentation, the func param to foldByKey() should be an 
associative merge function. So I can think of how this can be used to get the 
max or min value per key, but not the top or bottom values. Since I am a 
spark-newbie, can you pls give an example of how one could use a priorityQueue 
with foldByKey() ? 

Also, the default PriorityQueue implementation in java.util is unbounded; could 
this cause OOM exceptions if the cardinality of the keyset is very large ? 



was (Author: ashishen...@gmail.com):
[~srowen] yes, technically its a good to have not a must have. I could think of 
many instances where such an API would be very convenient and useful for users 
to have. 

I was using the aggregateByKey() with a custom written bounded priority queue. 
As per the spark documentation, the func param to foldByKey() should be an 
associative merge function. So I can think of how this can be used to get the 
max or min value per key, but not the top or bottom values. Since I am a 
spark-noob, can you pls give an example of how one could use a priorityQueue 
with foldByKey() ? 

Also, the default PriorityQueue implementation in java.util is unbounded; could 
this cause OOM exceptions if the cardinality of the keyset is very large ? 


> Support takeOrdered and topK values per key
> ---
>
> Key: SPARK-10405
> URL: https://issues.apache.org/jira/browse/SPARK-10405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: ashish shenoy
>  Labels: features, newbie
>
> Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
> items from a given RDD. 
> It'd be good to have an API that returned the "top" values per key for a 
> keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the 
> task is to only display an ordered subset of the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10405) Support takeOrdered and topK values per key

2015-09-01 Thread ashish shenoy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726331#comment-14726331
 ] 

ashish shenoy edited comment on SPARK-10405 at 9/1/15 10:38 PM:


[~srowen] yes, technically its a good to have not a must have. I could think of 
many instances where such an API would be very convenient and useful for users 
to have. 

I was using the aggregateByKey() with a custom written bounded priority queue. 
As per the spark documentation, the func param to foldByKey() should be an 
associative merge function. So I can think of how this can be used to get the 
max or min value per key, but not the top or bottom values. Since I am a 
spark-noob, can you pls give an example of how one could use a priorityQueue 
with foldByKey() ? 

Also, the default PriorityQueue implementation in java.util is unbounded; could 
this cause OOM exceptions if the cardinality of the keyset is very large ? 



was (Author: ashishen...@gmail.com):
[~srowen] yes, technically its a good to have not a must have. I could think of 
many instances where such an API would be very convenient and useful for users 
to have. 

Thanks for that foldByKey() tip; I was using the aggregateByKey() with a custom 
written bounded priority queue. Since I am a spark-noob, can you pls give an 
example of how one could use a priorityQueue with foldByKey() ? Also, the 
default PriorityQueue implementation in java.util is unbounded; could this 
cause OOM exceptions if the cardinality of the keyset is very large ? 


> Support takeOrdered and topK values per key
> ---
>
> Key: SPARK-10405
> URL: https://issues.apache.org/jira/browse/SPARK-10405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: ashish shenoy
>  Labels: features, newbie
>
> Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
> items from a given RDD. 
> It'd be good to have an API that returned the "top" values per key for a 
> keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the 
> task is to only display an ordered subset of the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage

2015-09-01 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726339#comment-14726339
 ] 

Kay Ousterhout commented on SPARK-2666:
---

[~irashid] totally agree, and IIRC there's a TODO suggesting we kill all 
remaining running tasks once a stage becomes a zombie somewhere in the 
scheduler code.  

> when task is FetchFailed cancel running tasks of failedStage
> 
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Lianhui Wang
>
> in DAGScheduler's handleTaskCompletion,when reason of failed task is 
> FetchFailed, cancel running tasks of failedStage before add failedStage to 
> failedStages queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10405) Support takeOrdered and topK values per key

2015-09-01 Thread ashish shenoy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726331#comment-14726331
 ] 

ashish shenoy commented on SPARK-10405:
---

[~srowen] yes, technically its a good to have not a must have. I could think of 
many instances where such an API would be very convenient and useful for users 
to have. 

Thanks for that foldByKey() tip; I was using the aggregateByKey() with a custom 
written bounded priority queue. Since I am a spark-noob, can you pls give an 
example of how one could use a priorityQueue with foldByKey() ? Also, the 
default PriorityQueue implementation in java.util is unbounded; could this 
cause OOM exceptions if the cardinality of the keyset is very large ? 


> Support takeOrdered and topK values per key
> ---
>
> Key: SPARK-10405
> URL: https://issues.apache.org/jira/browse/SPARK-10405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: ashish shenoy
>  Labels: features, newbie
>
> Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
> items from a given RDD. 
> It'd be good to have an API that returned the "top" values per key for a 
> keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the 
> task is to only display an ordered subset of the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10407) Possible Stack-overflow using InheritableThreadLocal nested-properties for SparkContext.localProperties

2015-09-01 Thread Matt Cheah (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-10407:
---
Description: 
In my long-running web server that eventually uses a SparkContext, I eventually 
came across some stack overflow errors that could only be cleared by restarting 
my server.

{code}
java.lang.StackOverflowError: null
at 
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2307) 
~[na:1.7.0_45]
at 
java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2718)
 ~[na:1.7.0_45]
at 
java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
 ~[na:1.7.0_45]
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1979) 
~[na:1.7.0_45]
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) 
~[na:1.7.0_45]
...
...
 at 
org.apache.commons.lang3.SerializationUtils.clone(SerializationUtils.java:96) 
~[commons-lang3-3.3.jar:3.3]
at 
org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:516) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:529) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1770) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1788) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1803) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1276) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
...
{code}

The bottom of the trace indicates that serializing a properties object is part 
of the stack when the overflow happens. I checked the origin of the properties, 
and it turns out it's coming from SparkContext.localProperties, an 
InheritableThreadLocal field.

When I debugged further, I found that localProperties.childValue() wraps its 
parent properties object in another properties object, and returns the wrapper 
properties. The problem is that every time childValue was being called, I was 
seeing the properties that was passed in from the parent have a deeper and 
deeper nesting of wrapped properties. This doesn't make any sense since my 
application doesn't create threads recursively or anything like that, so I'm 
marking this issue as a minor one since it shouldn't affect the average 
application.

On the other hand, there shouldn't really be any reason to be creating the 
properties in childValue using nesting. Instead, the properties returned by 
childValue should be flattened, and more importantly, a deep copy of the 
parent.I'm also concerned about the parent thread possibly modifying the 
wrapped properties object while it's being used by the child thread, creating 
possible race conditions since Properties is not thread-safe.

  was:
In my long-running web server that eventually uses a SparkContext, I eventually 
came across some stack overflow errors that could only be cleared by restarting 
my server.

{code}
java.lang.StackOverflowError: null
at 
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2307) 
~[na:1.7.0_45]
at 
java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2718)
 ~[na:1.7.0_45]
at 
java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
 ~[na:1.7.0_45]
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1979) 
~[na:1.7.0_45]
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) 
~[na:1.7.0_45]
...
...
 at 
org.apache.commons.lang3.SerializationUtils.clone(SerializationUtils.java:96) 
~[commons-lang3-3.3.jar:3.3]
at 
org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:516) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:529) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1770) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1788) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1803) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1276) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
...
{code}

The bottom of the trace indicates that serializing a properties object is part 
of the stack when the overflow happens. I checked the origin of the properties, 
and it turns 

[jira] [Created] (SPARK-10407) Possible Stack-overflow using InheritableThreadLocal nested-properties for SparkContext.localProperties

2015-09-01 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-10407:
--

 Summary: Possible Stack-overflow using InheritableThreadLocal 
nested-properties for SparkContext.localProperties
 Key: SPARK-10407
 URL: https://issues.apache.org/jira/browse/SPARK-10407
 Project: Spark
  Issue Type: Bug
Reporter: Matt Cheah
Priority: Minor


In my long-running web server that eventually uses a SparkContext, I eventually 
came across some stack overflow errors that could only be cleared by restarting 
my server.

{code}
java.lang.StackOverflowError: null
at 
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2307) 
~[na:1.7.0_45]
at 
java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2718)
 ~[na:1.7.0_45]
at 
java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
 ~[na:1.7.0_45]
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1979) 
~[na:1.7.0_45]
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) 
~[na:1.7.0_45]
...
...
 at 
org.apache.commons.lang3.SerializationUtils.clone(SerializationUtils.java:96) 
~[commons-lang3-3.3.jar:3.3]
at 
org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:516) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:529) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1770) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1788) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1803) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1276) 
~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1]
...
{code}

The bottom of the trace indicates that serializing a properties object is part 
of the stack when the overflow happens. I checked the origin of the properties, 
and it turns out it's coming from SparkContext.localProperties, an 
InheritableThreadLocal field.

When I debugged further, I found that localProperties.childValue() wraps its 
parent properties object in another properties object, and returns the wrapper 
properties. The problem is that every time childValue was being called, I was 
seeing the properties that was passed in from the parent have a deeper and 
deeper nesting of wrapped properties. This doesn't make any sense since my 
application doesn't create threads recursively or anything like that, so I'm 
marking this issue as a minor one since it shouldn't affect the average 
application.

On the other hand, there shouldn't really be any reason to be creating the 
properties in childValue using nesting. Instead, the properties returned by 
childValue should be flattened, and more importantly, a deep copy of the parent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10406) Document spark on yarn distributed cache symlink functionality

2015-09-01 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-10406:
-

 Summary: Document spark on yarn distributed cache symlink 
functionality
 Key: SPARK-10406
 URL: https://issues.apache.org/jira/browse/SPARK-10406
 Project: Spark
  Issue Type: Bug
  Components: Documentation, YARN
Affects Versions: 1.5.0
Reporter: Thomas Graves


Spark on Yarn supports using the distributed cache via --files, --jars, 
--archives.  It also supports specifying a name for those via #.  
ie  foo.tgz#myname.

myname is what foo.tgz is unarchived as and shows up in the local directory of 
the application .  Similarly for files and jars.

We should document this support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10405) Support takeOrdered and topK values per key

2015-09-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726296#comment-14726296
 ] 

Sean Owen commented on SPARK-10405:
---

This is fairly easy already with foldByKey and a priority queue -- does it 
really need its own API method?

> Support takeOrdered and topK values per key
> ---
>
> Key: SPARK-10405
> URL: https://issues.apache.org/jira/browse/SPARK-10405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: ashish shenoy
>  Labels: features, newbie
>
> Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
> items from a given RDD. 
> It'd be good to have an API that returned the "top" values per key for a 
> keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the 
> task is to only display an ordered subset of the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10405) Support takeOrdered and topK values per key

2015-09-01 Thread ashish shenoy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ashish shenoy updated SPARK-10405:
--
Description: 
Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
values from a given RDD. 

It'd be good to have an API that returned the "top" values per key for a keyed 
RDD i.e. RDDpair. Such an API would be very useful for cases where the task is 
to only display an ordered subset of the input data.

  was:
Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
items from a given RDD. 

It'd be good to have an API that returned the "top" values per key for a keyed 
RDD i.e. RDDpair. Such an API would be very useful for cases where the task is 
to only display an ordered subset of the input data.


> Support takeOrdered and topK values per key
> ---
>
> Key: SPARK-10405
> URL: https://issues.apache.org/jira/browse/SPARK-10405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: ashish shenoy
>  Labels: features, newbie
>
> Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
> values from a given RDD. 
> It'd be good to have an API that returned the "top" values per key for a 
> keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the 
> task is to only display an ordered subset of the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10405) Support takeOrdered and topK values per key

2015-09-01 Thread ashish shenoy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ashish shenoy updated SPARK-10405:
--
Description: 
Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
items from a given RDD. 

It'd be good to have an API that returned the "top" values per key for a keyed 
RDD i.e. RDDpair. Such an API would be very useful for cases where the task is 
to only display an ordered subset of the input data.

  was:
Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
values from a given RDD. 

It'd be good to have an API that returned the "top" values per key for a keyed 
RDD i.e. RDDpair. Such an API would be very useful for cases where the task is 
to only display an ordered subset of the input data.


> Support takeOrdered and topK values per key
> ---
>
> Key: SPARK-10405
> URL: https://issues.apache.org/jira/browse/SPARK-10405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: ashish shenoy
>  Labels: features, newbie
>
> Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
> items from a given RDD. 
> It'd be good to have an API that returned the "top" values per key for a 
> keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the 
> task is to only display an ordered subset of the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10405) Support takeOrdered and topK values per key

2015-09-01 Thread ashish shenoy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ashish shenoy updated SPARK-10405:
--
Description: 
Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
items from a given RDD. 

It'd be good to have an API that returned the "top" values per key for a keyed 
RDD i.e. RDDpair. Such an API would be very useful for cases where the task is 
to only display an ordered subset of the input data.

  was:
Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
items from a given RDD. 

It'd be good to have an API that returned the "top" items per key for a keyed 
RDD i.e. RDDpair. Such an API would be very useful for cases where the task is 
to only display an ordered subset of the input data.


> Support takeOrdered and topK values per key
> ---
>
> Key: SPARK-10405
> URL: https://issues.apache.org/jira/browse/SPARK-10405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: ashish shenoy
>  Labels: features, newbie
>
> Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
> items from a given RDD. 
> It'd be good to have an API that returned the "top" values per key for a 
> keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the 
> task is to only display an ordered subset of the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10405) Support takeOrdered and topK values per key

2015-09-01 Thread ashish shenoy (JIRA)
ashish shenoy created SPARK-10405:
-

 Summary: Support takeOrdered and topK values per key
 Key: SPARK-10405
 URL: https://issues.apache.org/jira/browse/SPARK-10405
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: ashish shenoy


Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" 
items from a given RDD. 

It'd be good to have an API that returned the "top" items per key for a keyed 
RDD i.e. RDDpair. Such an API would be very useful for cases where the task is 
to only display an ordered subset of the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10392:
---
Fix Version/s: 1.5.1

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
> Fix For: 1.6.0, 1.5.1
>
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10392.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8556
[https://github.com/apache/spark/pull/8556]

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
> Fix For: 1.6.0
>
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10162) PySpark filters with datetimes mess up when datetimes have timezones.

2015-09-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10162.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8555
[https://github.com/apache/spark/pull/8555]

> PySpark filters with datetimes mess up when datetimes have timezones.
> -
>
> Key: SPARK-10162
> URL: https://issues.apache.org/jira/browse/SPARK-10162
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Kevin Cox
> Fix For: 1.6.0
>
>
> PySpark appears to ignore timezone information when filtering on (and working 
> in general with) datetimes.
> Please see the example below. The generated filter in the query plan is 5 
> hours off (my computer is EST).
> {code}
> In [1]: df = sc.sql.createDataFrame([], StructType([StructField("dt", 
> TimestampType())]))
> In [2]: df.filter(df.dt > datetime(2000, 01, 01, tzinfo=UTC)).explain()
> Filter (dt#9 > 9467028)
>  Scan PhysicalRDD[dt#9]
> {code}
> Note that 9467028 == Sat  1 Jan 2000 05:00:00 UTC



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9516) Improve Thread Dump page

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9516:
-
Target Version/s: 1.6.0

> Improve Thread Dump page
> 
>
> Key: SPARK-9516
> URL: https://issues.apache.org/jira/browse/SPARK-9516
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Nan Zhu
>
> Originally proposed by [~irashid] in 
> https://github.com/apache/spark/pull/7808#issuecomment-126788335:
> we can enhance the current thread dump page with at least the following two 
> new features:
> 1) sort threads by thread status, 
> 2) a filter to grep the threads



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9516) Improve Thread Dump page

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9516:
-
Assignee: Nan Zhu

> Improve Thread Dump page
> 
>
> Key: SPARK-9516
> URL: https://issues.apache.org/jira/browse/SPARK-9516
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>
> Originally proposed by [~irashid] in 
> https://github.com/apache/spark/pull/7808#issuecomment-126788335:
> we can enhance the current thread dump page with at least the following two 
> new features:
> 1) sort threads by thread status, 
> 2) a filter to grep the threads



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9769) Add Python API for ml.feature.CountVectorizerModel

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9769:
---

Assignee: Apache Spark

> Add Python API for ml.feature.CountVectorizerModel
> --
>
> Key: SPARK-9769
> URL: https://issues.apache.org/jira/browse/SPARK-9769
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Add Python API, user guide and example for ml.feature.CountVectorizerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9769) Add Python API for ml.feature.CountVectorizerModel

2015-09-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9769:
---

Assignee: (was: Apache Spark)

> Add Python API for ml.feature.CountVectorizerModel
> --
>
> Key: SPARK-9769
> URL: https://issues.apache.org/jira/browse/SPARK-9769
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add Python API, user guide and example for ml.feature.CountVectorizerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9769) Add Python API for ml.feature.CountVectorizerModel

2015-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726265#comment-14726265
 ] 

Apache Spark commented on SPARK-9769:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8561

> Add Python API for ml.feature.CountVectorizerModel
> --
>
> Key: SPARK-9769
> URL: https://issues.apache.org/jira/browse/SPARK-9769
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add Python API, user guide and example for ml.feature.CountVectorizerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10404) Worker should terminate previous executor before launch new one

2015-09-01 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10404:
--

 Summary: Worker should terminate previous executor before launch 
new one
 Key: SPARK-10404
 URL: https://issues.apache.org/jira/browse/SPARK-10404
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


Reported here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hung-spark-executors-don-t-count-toward-worker-memory-limit-td16083.html#a24548

If new launched executor is overlapped with previous ones, they could run out 
of memory in the machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4223) Support * (meaning all users) as part of the acls

2015-09-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4223.

   Resolution: Fixed
Fix Version/s: 1.6.0

> Support * (meaning all users) as part of the acls
> -
>
> Key: SPARK-4223
> URL: https://issues.apache.org/jira/browse/SPARK-4223
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
> Fix For: 1.6.0
>
>
> Currently we support setting view and modify acls but you have to specify a 
> list of users.  It would be nice to support * meaning all users have access.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5269) BlockManager.dataDeserialize always creates a new serializer instance

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5269:
-
Target Version/s: 1.6.0

> BlockManager.dataDeserialize always creates a new serializer instance
> -
>
> Key: SPARK-5269
> URL: https://issues.apache.org/jira/browse/SPARK-5269
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ivan Vergiliev
>Assignee: Matt Cheah
>  Labels: performance, serializers
>
> BlockManager.dataDeserialize always creates a new instance of the serializer, 
> which is pretty slow in some cases. I'm using Kryo serialization and have a 
> custom registrator, and its register method is showing up as taking about 15% 
> of the execution time in my profiles. This started happening after I 
> increased the number of keys in a job with a shuffle phase by a factor of 40.
> One solution I can think of is to create a ThreadLocal SerializerInstance for 
> the defaultSerializer, and only create a new one if a custom serializer is 
> passed in. AFAICT a custom serializer is passed only from 
> DiskStore.getValues, and that, on the other hand, depends on the serializer 
> passed to ExternalSorter. I don't know how often this is used, but I think 
> this can still be a good solution for the standard use case.
> Oh, and also - ExternalSorter already has a SerializerInstance, so if the 
> getValues method is called from a single thread, maybe we can pass that 
> directly?
> I'd be happy to try a patch but would probably need a confirmation from 
> someone that this approach would indeed work (or an idea for another).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10081) Skip re-computing getMissingParentStages in DAGScheduler

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10081:
--
Target Version/s: 1.6.0

> Skip re-computing getMissingParentStages in DAGScheduler
> 
>
> Key: SPARK-10081
> URL: https://issues.apache.org/jira/browse/SPARK-10081
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> In DAGScheduler, we can skip re-computing getMissingParentStages when calling 
> submitStage in handleJobSubmitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10081) Skip re-computing getMissingParentStages in DAGScheduler

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10081:
--
Issue Type: Improvement  (was: Bug)

> Skip re-computing getMissingParentStages in DAGScheduler
> 
>
> Key: SPARK-10081
> URL: https://issues.apache.org/jira/browse/SPARK-10081
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> In DAGScheduler, we can skip re-computing getMissingParentStages when calling 
> submitStage in handleJobSubmitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10247) Cleanup DAGSchedulerSuite "ignore late map task completion"

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10247:
--
Component/s: (was: Spark Core)
 Tests
 Scheduler

> Cleanup DAGSchedulerSuite "ignore late map task completion"
> ---
>
> Key: SPARK-10247
> URL: https://issues.apache.org/jira/browse/SPARK-10247
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler, Tests
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
>
> the "ignore late map task completion" test in {{DAGSchedulerSuite}} is a bit 
> confusing, we can add a few asserts & comments to clarify a little



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10247) Cleanup DAGSchedulerSuite "ignore late map task completion"

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10247:
--
Target Version/s: 1.6.0

> Cleanup DAGSchedulerSuite "ignore late map task completion"
> ---
>
> Key: SPARK-10247
> URL: https://issues.apache.org/jira/browse/SPARK-10247
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler, Tests
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Trivial
>
> the "ignore late map task completion" test in {{DAGSchedulerSuite}} is a bit 
> confusing, we can add a few asserts & comments to clarify a little



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10247) Cleanup DAGSchedulerSuite "ignore late map task completion"

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10247:
--
Priority: Trivial  (was: Minor)

> Cleanup DAGSchedulerSuite "ignore late map task completion"
> ---
>
> Key: SPARK-10247
> URL: https://issues.apache.org/jira/browse/SPARK-10247
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler, Tests
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Trivial
>
> the "ignore late map task completion" test in {{DAGSchedulerSuite}} is a bit 
> confusing, we can add a few asserts & comments to clarify a little



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-09-01 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726076#comment-14726076
 ] 

Cody Koeninger commented on SPARK-10320:


You would supply a function, similar to the way createDirectStream currently 
takes a messageHandler: MessageAndMetadata[K, V] => R

The type of that function would be 

(Time, Map[TopicAndPartition, Long], Map[TopicAndPartition, LeaderOffset]) => 
(Map[TopicAndPartition, Long, Map[TopicAndPartition, LeaderOffset])

in other words

(time, fromOffsets, untilOffsets) => (fromOffsets, untilOffsets)

Your function would be called in the compute() method of the dstream, after 
contacting the leaders and before making the rdd for the next batch.
That would let you make arbitrary modifications to the topics / partitions / 
offsets.

As far as the desire for a general solution, I think this is a kafka-specific 
concern.  Not all streams have topics.

> Kafka Support new topic subscriptions without requiring restart of the 
> streaming context
> 
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10288) Add a rest client for Spark on Yarn

2015-09-01 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725955#comment-14725955
 ] 

Marcelo Vanzin commented on SPARK-10288:


So can that instead be used as the reasoning in the design document? It talks 
about standalone and mesos having rest servers as if that by itself is a reason 
to have support for rest. The PR also talks about how now "YARN also supports 
this function", but since the backends are completely different in all cases, 
it makes no sense to mention standalone or mesos here.

> Add a rest client for Spark on Yarn
> ---
>
> Key: SPARK-10288
> URL: https://issues.apache.org/jira/browse/SPARK-10288
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> This is a proposal to add rest client for Spark on Yarn. Currently Spark 
> standalone and Mesos mode can support rest way of submitting applications, 
> for Spark on Yarn, it still uses program way to do it. Since RM now (from 
> 2.6) supports rest way of submitting application, so it would be better Spark 
> on Yarn also support this way.
> Here is the design doc 
> (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing).
> Currently I'm working on it, working branch is 
> (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major 
> part is already finished.
> Any comment is greatly appreciated, thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-09-01 Thread Sudarshan Kadambi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725951#comment-14725951
 ] 

Sudarshan Kadambi commented on SPARK-10320:
---

"it's almost certainly not the same thread".
Yes, you're right. The new topic additions would happen in a different thread 
than the one that initialized the spark context and started the streaming 
context.

Could you describe how the map of topic-partition and consumption offsets would 
be supplied? As an additional argument to createDirectStream() (callable even 
after the streaming context is started?) Perhaps a more complete sketch of the 
possible solution (even from just an end user API perspective) would help. 
Also, while we're looking to solve this problem in the context of Kafka, it'd 
be better to generalize the solution over all sorts of channels over which data 
can stream over.

> Kafka Support new topic subscriptions without requiring restart of the 
> streaming context
> 
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-09-01 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725935#comment-14725935
 ] 

Meihua Wu commented on SPARK-8518:
--

For the reference implementations, recommend we consider this R function: 
https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html 



> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10379) UnsafeShuffleExternalSorter should preserve first page

2015-09-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10379:
---
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> UnsafeShuffleExternalSorter should preserve first page
> --
>
> Key: SPARK-10379
> URL: https://issues.apache.org/jira/browse/SPARK-10379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> 5/08/31 18:41:25 WARN TaskSetManager: Lost task 16.1 in stage 316.0 (TID 
> 32686, lon4-hadoopslave-b925.lon4.spotify.net): java.io.IOException: Unable 
> to acquire 67108864 bytes of memory
> at 
> org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.acquireNewPageIfNecessary(UnsafeShuffleExternalSorter.java:385)
> at 
> org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.insertRecord(UnsafeShuffleExternalSorter.java:435)
> at 
> org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
> at 
> org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:174)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10403) UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)

2015-09-01 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10403:
--

 Summary: UnsafeRowSerializer can't work with UnsafeShuffleManager 
(tungsten-sort)
 Key: SPARK-10403
 URL: https://issues.apache.org/jira/browse/SPARK-10403
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu


UnsafeRowSerializer reply on EOF in the stream, but UnsafeRowWriter will not 
write EOF between partitions.

{code}
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:122)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:174)
at 
org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10394) Make GBTParams use shared "stepSize"

2015-09-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10394:
--
 Assignee: Yanbo Liang
Affects Version/s: 1.5.0
 Target Version/s: 1.6.0

> Make GBTParams use shared "stepSize"
> 
>
> Key: SPARK-10394
> URL: https://issues.apache.org/jira/browse/SPARK-10394
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> GBTParams has "stepSize" as learning rate currently.
> ML has shared param class "HasStepSize", GBTParams can extend from it rather 
> than duplicated implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10288) Add a rest client for Spark on Yarn

2015-09-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725913#comment-14725913
 ] 

Steve Loughran commented on SPARK-10288:


Long Haul job submission. You can't currently submit work to a running cluster 
if the RPC channel isn't open to you, which in cloud environments means "ssh 
tunnel fun" or "somehow get into the cluster"



> Add a rest client for Spark on Yarn
> ---
>
> Key: SPARK-10288
> URL: https://issues.apache.org/jira/browse/SPARK-10288
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> This is a proposal to add rest client for Spark on Yarn. Currently Spark 
> standalone and Mesos mode can support rest way of submitting applications, 
> for Spark on Yarn, it still uses program way to do it. Since RM now (from 
> 2.6) supports rest way of submitting application, so it would be better Spark 
> on Yarn also support this way.
> Here is the design doc 
> (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing).
> Currently I'm working on it, working branch is 
> (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major 
> part is already finished.
> Any comment is greatly appreciated, thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10324:
--
Description: 
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spark 1.6. Please view this list as a wish list rather than a concrete 
plan, because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add `@Since("1.6.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* log-linear model for survival analysis (SPARK-8518)
* normal equation approach for linear regression (SPARK-9834)
* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* bisecting k-means (SPARK-6517)
* weighted instance support (SPARK-9610)
** logistic regression (SPARK-7685)
** linear regression (SPARK-9642)
** random forest (SPARK-9478)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-2352)
** autoencoder (SPARK-4288)
** restricted Boltzmann machine (RBM) (SPARK-4251)
** convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* local linear algebra (SPARK-6442)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* univariate statistics as UDAFs (SPARK-10384)
* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* online hypothesis testing (SPARK-3147)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
* ML attribute API improvements (SPARK-8515)
* feature transformers (SPARK-9930)
** feature interaction (SPARK-9698)
** SQL transformer (SPARK-8345)
** ??
* test Kaggle datasets (SPARK-9941)

h2. Model persistence

* PMML export
** naive Bayes (SPARK-8546)
** decision tree (SPARK-8542)
* model save/load
** FPGrowth (SPARK-6724)
** PrefixSpan (SPARK-10386)
* code generation
** decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* LIBSVM data source (SPARK-10117)
* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API. You 
can find a complete list 
[here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall 
into two major categories:

* Python API for new algorithms
* Python API for missing methods

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-9838, 
SPARK-9839, SPARK-9840)
* better R formula support (SPARK-9681)
* model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)

h2. Documentation

* re-organize user guide (SPARK-8517)
* @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
* automatically test example cod

[jira] [Commented] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work

2015-09-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725911#comment-14725911
 ] 

Sean Owen commented on SPARK-10375:
---

I don't think this is a problem in the sense that you would not be setting 
spark.driver props in your program anyway, kind of by definition. "Fixing" it 
just to emit a warning entails tracking the source of properties, whether it 
was set in one place, overridden elsewhere, then maintaining some blacklist of 
properties, etc.

> Setting the driver memory with SparkConf().set("spark.driver.memory","1g") 
> does not work
> 
>
> Key: SPARK-10375
> URL: https://issues.apache.org/jira/browse/SPARK-10375
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: Running with yarn
>Reporter: Thomas
>Priority: Minor
>
> When running pyspark 1.3.0 with yarn, the following code has no effect:
> pyspark.SparkConf().set("spark.driver.memory","1g")
> The Environment tab in yarn shows that the driver has 1g, however, the 
> Executors tab only shows 512 M (the default value) for the driver memory.  
> This issue goes away when the driver memory is specified via the command line 
> (i.e. --driver-memory 1g)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work

2015-09-01 Thread Alex Rovner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725909#comment-14725909
 ] 

Alex Rovner commented on SPARK-10375:
-

[~srowen] Shall we re-open?

> Setting the driver memory with SparkConf().set("spark.driver.memory","1g") 
> does not work
> 
>
> Key: SPARK-10375
> URL: https://issues.apache.org/jira/browse/SPARK-10375
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: Running with yarn
>Reporter: Thomas
>Priority: Minor
>
> When running pyspark 1.3.0 with yarn, the following code has no effect:
> pyspark.SparkConf().set("spark.driver.memory","1g")
> The Environment tab in yarn shows that the driver has 1g, however, the 
> Executors tab only shows 512 M (the default value) for the driver memory.  
> This issue goes away when the driver memory is specified via the command line 
> (i.e. --driver-memory 1g)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9043) Serialize key, value and combiner classes in ShuffleDependency

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9043:
-
Target Version/s: 1.6.0

> Serialize key, value and combiner classes in ShuffleDependency
> --
>
> Key: SPARK-9043
> URL: https://issues.apache.org/jira/browse/SPARK-9043
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Matt Massie
>
> ShuffleManager implementations are currently not given type information 
> regarding the key, value and combiner classes. Serialization of shuffle 
> objects relies on them being JavaSerializable, with methods defined for 
> reading/writing the object or, alternatively, serialization via Kryo which 
> uses reflection.
> Serialization systems like Avro, Thrift and Protobuf generate classes with 
> zero argument constructors and explicit schema information (e.g. 
> IndexedRecords in Avro have get, put and getSchema methods).
> By serializing the key, value and combiner class names in ShuffleDependency, 
> shuffle implementations will have access to schema information when 
> registerShuffle() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9043) Serialize key, value and combiner classes in ShuffleDependency

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9043:
-
Assignee: Matt Massie

> Serialize key, value and combiner classes in ShuffleDependency
> --
>
> Key: SPARK-9043
> URL: https://issues.apache.org/jira/browse/SPARK-9043
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Matt Massie
>Assignee: Matt Massie
>
> ShuffleManager implementations are currently not given type information 
> regarding the key, value and combiner classes. Serialization of shuffle 
> objects relies on them being JavaSerializable, with methods defined for 
> reading/writing the object or, alternatively, serialization via Kryo which 
> uses reflection.
> Serialization systems like Avro, Thrift and Protobuf generate classes with 
> zero argument constructors and explicit schema information (e.g. 
> IndexedRecords in Avro have get, put and getSchema methods).
> By serializing the key, value and combiner class names in ShuffleDependency, 
> shuffle implementations will have access to schema information when 
> registerShuffle() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts

2015-09-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10398.
---
   Resolution: Fixed
Fix Version/s: (was: 1.5.0)
   1.5.1
   1.6.0

Issue resolved by pull request 8557
[https://github.com/apache/spark/pull/8557]

> Migrate Spark download page to use new lua mirroring scripts
> 
>
> Key: SPARK-10398
> URL: https://issues.apache.org/jira/browse/SPARK-10398
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Luciano Resende
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
> Attachments: SPARK-10398
>
>
> From infra team :
> If you refer to www.apache.org/dyn/closer.cgi, please refer to
> www.apache.org/dyn/closer.lua instead from now on.
> Any non-conforming CGI scripts are no longer enabled, and are all
> rewritten to go to our new mirror system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9043) Serialize key, value and combiner classes in ShuffleDependency

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9043:
-
Component/s: (was: Spark Core)
 Shuffle

> Serialize key, value and combiner classes in ShuffleDependency
> --
>
> Key: SPARK-9043
> URL: https://issues.apache.org/jira/browse/SPARK-9043
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Matt Massie
>
> ShuffleManager implementations are currently not given type information 
> regarding the key, value and combiner classes. Serialization of shuffle 
> objects relies on them being JavaSerializable, with methods defined for 
> reading/writing the object or, alternatively, serialization via Kryo which 
> uses reflection.
> Serialization systems like Avro, Thrift and Protobuf generate classes with 
> zero argument constructors and explicit schema information (e.g. 
> IndexedRecords in Avro have get, put and getSchema methods).
> By serializing the key, value and combiner class names in ShuffleDependency, 
> shuffle implementations will have access to schema information when 
> registerShuffle() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies

2015-09-01 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-10370:
-
Component/s: (was: Spark Core)
 Scheduler

> After a stages map outputs are registered, all running attempts should be 
> marked as zombies
> ---
>
> Key: SPARK-10370
> URL: https://issues.apache.org/jira/browse/SPARK-10370
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>
> Follow up to SPARK-5259.  During stage retry, its possible for a stage to 
> "complete" by registering all its map output and starting the downstream 
> stages, before the latest task set has completed.  This will result in the 
> earlier task set continuing to submit tasks, that are both unnecessary and 
> increase the chance of hitting SPARK-8029.
> Spark should mark all tasks sets for a stage as zombie as soon as its map 
> output is registered.  Note that this involves coordination between the 
> various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at 
> least) which isn't easily testable with the current setup.
> To be clear, this is *not* just referring to canceling running tasks (which 
> may be taken care of by SPARK-2666).  This is to make sure that the taskset 
> is marked as a zombie, to prevent submitting *new* tasks from this task set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10296) add preservesParitioning parameter to RDD.map

2015-09-01 Thread Esteban Donato (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725878#comment-14725878
 ] 

Esteban Donato commented on SPARK-10296:


any further thought on this issue? Do you think it deserves a pull request with 
the enhancement?

> add preservesParitioning parameter to RDD.map
> -
>
> Key: SPARK-10296
> URL: https://issues.apache.org/jira/browse/SPARK-10296
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Esteban Donato
>Priority: Minor
>
> It would be nice to add the Boolean parameter preservesParitioning with 
> default false to RDD.map method just as it is in RDD.mapPartitions method.
> If you agree I can submit a pull request with this enhancement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10372) Add end-to-end tests for the scheduling code

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10372:
--
Target Version/s: 1.6.0

> Add end-to-end tests for the scheduling code
> 
>
> Key: SPARK-10372
> URL: https://issues.apache.org/jira/browse/SPARK-10372
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler, Tests
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>
> The current testing framework for the scheduler only tests individual classes 
> in isolation: {{DAGSchedulerSuite}}, {{TaskSchedulerImplSuite}}, etc.  Of 
> course that is useful, but we are missing tests which cover the interaction 
> between these components.  We also have larger tests which run entire spark 
> jobs, but that doesn't allow fine grained control of failures for verifying 
> spark's fault-tolerance.
> Adding a framework for testing the scheduler as a whole will:
> 1. Allow testing bugs which involve the interaction between multiple parts of 
> the scheduler, eg. SPARK-10370
> 2. Greater confidence in refactoring the scheduler as a whole.  Given the 
> tight coordination between the components its hard to consider any 
> refactoring, since it would be unlikely to be covered by any tests.
> 3. Make it easier to increase test coverage.  Writing tests for the 
> {{DAGScheduler}} now requires intimate knowledge of exactly how the 
> components fit together -- a lot of work goes into mimicking the appropriate 
> behavior of the other components.  Furthermore, it makes the tests harder to 
> understand for the un-initiated -- which parts are simulating some condition 
> of an external system (eg., losing an executor), and which parts are just 
> interaction with other parts of the scheduler (eg., task resubmission)?  
> These tests will allow to just work at the level of the interaction w/ the 
> executors -- tasks complete, tasks fail, executors are lost, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10372) Add end-to-end tests for the scheduling code

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10372:
--
Component/s: (was: Spark Core)
 Tests
 Scheduler

> Add end-to-end tests for the scheduling code
> 
>
> Key: SPARK-10372
> URL: https://issues.apache.org/jira/browse/SPARK-10372
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler, Tests
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>
> The current testing framework for the scheduler only tests individual classes 
> in isolation: {{DAGSchedulerSuite}}, {{TaskSchedulerImplSuite}}, etc.  Of 
> course that is useful, but we are missing tests which cover the interaction 
> between these components.  We also have larger tests which run entire spark 
> jobs, but that doesn't allow fine grained control of failures for verifying 
> spark's fault-tolerance.
> Adding a framework for testing the scheduler as a whole will:
> 1. Allow testing bugs which involve the interaction between multiple parts of 
> the scheduler, eg. SPARK-10370
> 2. Greater confidence in refactoring the scheduler as a whole.  Given the 
> tight coordination between the components its hard to consider any 
> refactoring, since it would be unlikely to be covered by any tests.
> 3. Make it easier to increase test coverage.  Writing tests for the 
> {{DAGScheduler}} now requires intimate knowledge of exactly how the 
> components fit together -- a lot of work goes into mimicking the appropriate 
> behavior of the other components.  Furthermore, it makes the tests harder to 
> understand for the un-initiated -- which parts are simulating some condition 
> of an external system (eg., losing an executor), and which parts are just 
> interaction with other parts of the scheduler (eg., task resubmission)?  
> These tests will allow to just work at the level of the interaction w/ the 
> executors -- tasks complete, tasks fail, executors are lost, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10192) Test for fetch failure in a shared dependency for "skipped" stages

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10192:
--
Component/s: (was: Spark Core)
 Scheduler

> Test for fetch failure in a shared dependency for "skipped" stages
> --
>
> Key: SPARK-10192
> URL: https://issues.apache.org/jira/browse/SPARK-10192
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler, Tests
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>
> One confusing corner case of the DAGScheduler is when there is a shared 
> shuffle dependency, a job might "skip" the stage associated with that shuffle 
> dependency, since its already been created as part of a different stage.  
> This means if there is a fetch failure, the retry will technically happen as 
> part of a different {{Stage}} instance.
> This already works, but is lacking tests, so I just plan on adding a simple 
> test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10372) Add end-to-end tests for the scheduling code

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10372:
--
Fix Version/s: (was: 1.6.0)

> Add end-to-end tests for the scheduling code
> 
>
> Key: SPARK-10372
> URL: https://issues.apache.org/jira/browse/SPARK-10372
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler, Tests
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>
> The current testing framework for the scheduler only tests individual classes 
> in isolation: {{DAGSchedulerSuite}}, {{TaskSchedulerImplSuite}}, etc.  Of 
> course that is useful, but we are missing tests which cover the interaction 
> between these components.  We also have larger tests which run entire spark 
> jobs, but that doesn't allow fine grained control of failures for verifying 
> spark's fault-tolerance.
> Adding a framework for testing the scheduler as a whole will:
> 1. Allow testing bugs which involve the interaction between multiple parts of 
> the scheduler, eg. SPARK-10370
> 2. Greater confidence in refactoring the scheduler as a whole.  Given the 
> tight coordination between the components its hard to consider any 
> refactoring, since it would be unlikely to be covered by any tests.
> 3. Make it easier to increase test coverage.  Writing tests for the 
> {{DAGScheduler}} now requires intimate knowledge of exactly how the 
> components fit together -- a lot of work goes into mimicking the appropriate 
> behavior of the other components.  Furthermore, it makes the tests harder to 
> understand for the un-initiated -- which parts are simulating some condition 
> of an external system (eg., losing an executor), and which parts are just 
> interaction with other parts of the scheduler (eg., task resubmission)?  
> These tests will allow to just work at the level of the interaction w/ the 
> executors -- tasks complete, tasks fail, executors are lost, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10372) Add end-to-end tests for the scheduler

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10372:
--
Summary: Add end-to-end tests for the scheduler  (was: Tests for entire 
scheduler)

> Add end-to-end tests for the scheduler
> --
>
> Key: SPARK-10372
> URL: https://issues.apache.org/jira/browse/SPARK-10372
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 1.6.0
>
>
> The current testing framework for the scheduler only tests individual classes 
> in isolation: {{DAGSchedulerSuite}}, {{TaskSchedulerImplSuite}}, etc.  Of 
> course that is useful, but we are missing tests which cover the interaction 
> between these components.  We also have larger tests which run entire spark 
> jobs, but that doesn't allow fine grained control of failures for verifying 
> spark's fault-tolerance.
> Adding a framework for testing the scheduler as a whole will:
> 1. Allow testing bugs which involve the interaction between multiple parts of 
> the scheduler, eg. SPARK-10370
> 2. Greater confidence in refactoring the scheduler as a whole.  Given the 
> tight coordination between the components its hard to consider any 
> refactoring, since it would be unlikely to be covered by any tests.
> 3. Make it easier to increase test coverage.  Writing tests for the 
> {{DAGScheduler}} now requires intimate knowledge of exactly how the 
> components fit together -- a lot of work goes into mimicking the appropriate 
> behavior of the other components.  Furthermore, it makes the tests harder to 
> understand for the un-initiated -- which parts are simulating some condition 
> of an external system (eg., losing an executor), and which parts are just 
> interaction with other parts of the scheduler (eg., task resubmission)?  
> These tests will allow to just work at the level of the interaction w/ the 
> executors -- tasks complete, tasks fail, executors are lost, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10192) Test for fetch failure in a shared dependency for "skipped" stages

2015-09-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10192:
--
Issue Type: Sub-task  (was: Test)
Parent: SPARK-8987

> Test for fetch failure in a shared dependency for "skipped" stages
> --
>
> Key: SPARK-10192
> URL: https://issues.apache.org/jira/browse/SPARK-10192
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>
> One confusing corner case of the DAGScheduler is when there is a shared 
> shuffle dependency, a job might "skip" the stage associated with that shuffle 
> dependency, since its already been created as part of a different stage.  
> This means if there is a fetch failure, the retry will technically happen as 
> part of a different {{Stage}} instance.
> This already works, but is lacking tests, so I just plan on adding a simple 
> test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >