[jira] [Created] (SPARK-10414) DenseMatrix gives different hashcode even though equals returns true
Vinod KC created SPARK-10414: Summary: DenseMatrix gives different hashcode even though equals returns true Key: SPARK-10414 URL: https://issues.apache.org/jira/browse/SPARK-10414 Project: Spark Issue Type: Bug Components: MLlib Reporter: Vinod KC Priority: Minor hashcode implementation in DenseMatrix gives different result for same input val dm = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) val dm1 = Matrices.dense(2, 2, Array(0.0, 1.0, 2.0, 3.0)) assert(dm1 === dm) // passed assert(dm1.hashCode === dm.hashCode) // Failed This violates the hashCode/equals contract. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9718) LinearRegressionTrainingSummary should hold all columns in transformed data
[ https://issues.apache.org/jira/browse/SPARK-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9718: --- Assignee: Apache Spark > LinearRegressionTrainingSummary should hold all columns in transformed data > --- > > Key: SPARK-9718 > URL: https://issues.apache.org/jira/browse/SPARK-9718 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > LinearRegression training summary: The transformed dataset should hold all > columns, not just selected ones like prediction and label. There is no real > need to remove some, and the user may find them useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9718) LinearRegressionTrainingSummary should hold all columns in transformed data
[ https://issues.apache.org/jira/browse/SPARK-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726845#comment-14726845 ] Apache Spark commented on SPARK-9718: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/8564 > LinearRegressionTrainingSummary should hold all columns in transformed data > --- > > Key: SPARK-9718 > URL: https://issues.apache.org/jira/browse/SPARK-9718 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > LinearRegression training summary: The transformed dataset should hold all > columns, not just selected ones like prediction and label. There is no real > need to remove some, and the user may find them useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9718) LinearRegressionTrainingSummary should hold all columns in transformed data
[ https://issues.apache.org/jira/browse/SPARK-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9718: --- Assignee: (was: Apache Spark) > LinearRegressionTrainingSummary should hold all columns in transformed data > --- > > Key: SPARK-9718 > URL: https://issues.apache.org/jira/browse/SPARK-9718 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > LinearRegression training summary: The transformed dataset should hold all > columns, not just selected ones like prediction and label. There is no real > need to remove some, and the user may find them useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9722) Pass random seed to spark.ml DecisionTree*
[ https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726844#comment-14726844 ] holdenk commented on SPARK-9722: I can do this if no one else is working on it :) > Pass random seed to spark.ml DecisionTree* > -- > > Key: SPARK-9722 > URL: https://issues.apache.org/jira/browse/SPARK-9722 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > > Trees use XORShiftRandom when binning continuous features. Currently, they > use a fixed seed of 1. They should accept a random seed param and use that > instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10288) Add a rest client for Spark on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-10288: Description: This is a proposal to add rest client for Spark on Yarn. Rest API offers a convenient addition to let user to submit application through rest client, people will easily achieve long haul submission, build their own submission gateway through rest client. Here is the design doc (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing). Currently I'm working on it, working branch is (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major part is already finished. Any comment is greatly appreciated, thanks a lot. was: This is a proposal to add rest client for Spark on Yarn. Currently Spark standalone and Mesos mode can support rest way of submitting applications, for Spark on Yarn, it still uses program way to do it. Since RM now (from 2.6) supports rest way of submitting application, so it would be better Spark on Yarn also support this way. Here is the design doc (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing). Currently I'm working on it, working branch is (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major part is already finished. Any comment is greatly appreciated, thanks a lot. > Add a rest client for Spark on Yarn > --- > > Key: SPARK-10288 > URL: https://issues.apache.org/jira/browse/SPARK-10288 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > This is a proposal to add rest client for Spark on Yarn. Rest API offers a > convenient addition to let user to submit application through rest client, > people will easily achieve long haul submission, build their own submission > gateway through rest client. > Here is the design doc > (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing). > Currently I'm working on it, working branch is > (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major > part is already finished. > Any comment is greatly appreciated, thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page
[ https://issues.apache.org/jira/browse/SPARK-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-8694. --- Resolution: Won't Fix > Defer executing drawTaskAssignmentTimeline until page loaded to avoid to > freeze the page > > > Key: SPARK-8694 > URL: https://issues.apache.org/jira/browse/SPARK-8694 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 1.4.0, 1.5.0 >Reporter: Kousuke Saruta > > When there are massive tasks in the stage page (such as, running > sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds > to render the graph (drawTaskAssignmentTimeline) in my environment. The page > is unresponsive until the graph is ready. > However, since Event Timeline is hidden by default, we can defer > drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So > that the user can view the page while rendering Event Timeline in the > background. > This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid > blocking loading page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page
[ https://issues.apache.org/jira/browse/SPARK-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726815#comment-14726815 ] Kousuke Saruta commented on SPARK-8694: --- Now this issue is addressed by the pagination. > Defer executing drawTaskAssignmentTimeline until page loaded to avoid to > freeze the page > > > Key: SPARK-8694 > URL: https://issues.apache.org/jira/browse/SPARK-8694 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 1.4.0, 1.5.0 >Reporter: Kousuke Saruta > > When there are massive tasks in the stage page (such as, running > sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds > to render the graph (drawTaskAssignmentTimeline) in my environment. The page > is unresponsive until the graph is ready. > However, since Event Timeline is hidden by default, we can defer > drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So > that the user can view the page while rendering Event Timeline in the > background. > This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid > blocking loading page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8402) DP means clustering
[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meethu Mathew updated SPARK-8402: - Description: At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance. The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters. DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters ["Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan]. We have followed the distributed implementation of DP means which has been proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre Wibisono. A benchmark comparison between k-means and dp-means based on Normalized Mutual Information between ground truth clusters and algorithm outputs, have been provided in the following table. It can be seen from the table that DP-means reported a higher NMI on 5 of 8 data sets in comparison to k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian nonparametrics (2011) Arxiv:.0352. (Table 1)] | Dataset | DP-means | k-means | | Wine | .41 | .43 | | Iris | .75 | .76 | | Pima | .02 | .03 | | Soybean | .72 | .66 | | Car | .07 | .05 | | Balance Scale | .17 | .11 | | Breast Cancer | .04 | .03 | | Vehicle | .18 | .18 | Experiment on our spark cluster setup: An initial benchmark study was performed on a 3 node Spark cluster setup on mesos where each node config was 8 Cores, 64 GB RAM and the spark version used was 1.5(git branch). Tests were done using a mixture of 10 Gaussians with varying number of features and instances. The results from the benchmark study are provided below. The reported stats are average over 5 runs. | DATASET || DPMEANS | | | KMEANS (k =10) | | | Instances | Dimensions | No of clusters obtained | Time | Converged in iterations | Time | Converged in iterations | | 10 million | 10 |10 | 43.6s |2 | 52.2s |2| | 1 million | 100|10 | 39.8s |2 | 43.39s |2| | 0.1 million |1000|10 | 37.3s |2 | 41.64s |2| was: At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance. The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters. DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters["Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan]. We have followed the distributed implementation of DP means which has been proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre Wibisono. > DP means clustering > > > Key: SPARK-8402 > URL: https://issues.apache.org/jira/browse/SPARK-8402 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Meethu Mathew >Assignee: Meethu Mathew > Labels: features > > At present, all the clustering algorithms in MLlib require the number of > clusters to be specified in advance. > The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model > that allows for flexible clustering of data without having to specify apriori > the number of clusters. > DP means is a non-parametric clustering algorithm that uses a scale parameter > 'lambda' to control the creation of new clusters ["Revisiting k-means: New > Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan]. > We have followed the distributed implementation of DP means which has been > proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" > by Xinghao Pan, Evan R. Sparks, Andre Wibisono. > A benchmark comparison between k-means and dp-means based on Normalized > Mutual Information between ground truth clusters and algorithm outputs, have > been provided in the following table. It can be seen from the table that > DP-means reported a higher NMI on 5 of 8 data sets in comparison to > k-means[Source: Kulis, B., Jordan,
[jira] [Commented] (SPARK-10288) Add a rest client for Spark on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726814#comment-14726814 ] Saisai Shao commented on SPARK-10288: - Hi [~vanzin], thanks a lot for your comments. Yes it doesn't make sense to compare Yarn with Standalone and Mesos, and the protocol is also different compared to other two cluster manager. I will update the description. But to some extent I think rest client is still meaningful as [~ste...@apache.org] mentioned. If you have any suggestion please let me know, thanks a lot. > Add a rest client for Spark on Yarn > --- > > Key: SPARK-10288 > URL: https://issues.apache.org/jira/browse/SPARK-10288 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > This is a proposal to add rest client for Spark on Yarn. Currently Spark > standalone and Mesos mode can support rest way of submitting applications, > for Spark on Yarn, it still uses program way to do it. Since RM now (from > 2.6) supports rest way of submitting application, so it would be better Spark > on Yarn also support this way. > Here is the design doc > (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing). > Currently I'm working on it, working branch is > (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major > part is already finished. > Any comment is greatly appreciated, thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8469) Application timeline view unreadable with many executors
[ https://issues.apache.org/jira/browse/SPARK-8469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726812#comment-14726812 ] Kousuke Saruta commented on SPARK-8469: --- Thanks for investigating the use case of the timeline view with dynamic allocation. I understand that showing the last N is not meaningful. Unfortunately, I don't have enough time for considering much better solution until the end of October. I'll try to address this issue after the end of October. > Application timeline view unreadable with many executors > > > Key: SPARK-8469 > URL: https://issues.apache.org/jira/browse/SPARK-8469 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Kousuke Saruta > Attachments: Screen Shot 2015-06-18 at 5.51.21 PM.png > > > This is a problem with using dynamic allocation with many executors. See > screenshot. We may want to limit the number of stacked events somehow. See > screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9717) Add persistence to MulticlassMetrics
[ https://issues.apache.org/jira/browse/SPARK-9717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726803#comment-14726803 ] holdenk commented on SPARK-9717: I was looking at this, but it seems like it doesn't make as much sense since there isn't an internal RDD. > Add persistence to MulticlassMetrics > > > Key: SPARK-9717 > URL: https://issues.apache.org/jira/browse/SPARK-9717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Add RDD persistence to MulticlassMetrics internals, following the example of > BinaryClassificationMetrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3871) compute-classpath.sh does not escape :
[ https://issues.apache.org/jira/browse/SPARK-3871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726785#comment-14726785 ] Iulian Dragos commented on SPARK-3871: -- There's no more compute-classpath.sh. Ok to close this? > compute-classpath.sh does not escape : > -- > > Key: SPARK-3871 > URL: https://issues.apache.org/jira/browse/SPARK-3871 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.1.0 >Reporter: Hector Yee >Priority: Minor > > Chronos jobs on Mesos schedule jobs in temp directories such as > /tmp/mesos/slaves/20140926-142803-3852091146-5050-3487-375/frameworks/20140719-203536-160311562-5050-10655-0007/executors/ct:1412815902180:2:search_ranking_scoring/runs/f1e0d058-3ef0-4838-816e-e3fa5e179dd8 > The compute-classpath.sh does not properly escape the : in the temp dirs > generated by mesos and so the spark-submit gets a broken classpath -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726784#comment-14726784 ] Iulian Dragos commented on SPARK-4940: -- Would it make sense to allocate resources in a round-robin fashion? Supposing Spark gets several offers at the same time, it would have enough info to balance executors on the available resources (or optionally, define an interval during which it holds on to resources it receives to accumulate a larger set of slaves). The algorithm may proceed by allocating a multiple `spark.task.cores` (below cap, see SPARK-9873 which might help on its own) on each slave in the set of resources, until it can't allocate anymore. > Support more evenly distributing cores for Mesos mode > - > > Key: SPARK-4940 > URL: https://issues.apache.org/jira/browse/SPARK-4940 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > Attachments: mesos-config-difference-3nodes-vs-2nodes.png > > > Currently in Coarse grain mode the spark scheduler simply takes all the > resources it can on each node, but can cause uneven distribution based on > resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8514: --- Assignee: (was: Apache Spark) > LU factorization on BlockMatrix > --- > > Key: SPARK-8514 > URL: https://issues.apache.org/jira/browse/SPARK-8514 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng > Labels: advanced > Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, > BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, Matrix > Factorization - M...ark 1.5.0 Documentation.pdf, testScript.scala > > > LU is the most common method to solve a general linear system or inverse a > general matrix. A distributed version could in implemented block-wise with > pipelining. A reference implementation is provided in ScaLAPACK: > http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726777#comment-14726777 ] Apache Spark commented on SPARK-8514: - User 'nilmeier' has created a pull request for this issue: https://github.com/apache/spark/pull/8563 > LU factorization on BlockMatrix > --- > > Key: SPARK-8514 > URL: https://issues.apache.org/jira/browse/SPARK-8514 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng > Labels: advanced > Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, > BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, Matrix > Factorization - M...ark 1.5.0 Documentation.pdf, testScript.scala > > > LU is the most common method to solve a general linear system or inverse a > general matrix. A distributed version could in implemented block-wise with > pipelining. A reference implementation is provided in ScaLAPACK: > http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8514: --- Assignee: Apache Spark > LU factorization on BlockMatrix > --- > > Key: SPARK-8514 > URL: https://issues.apache.org/jira/browse/SPARK-8514 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Apache Spark > Labels: advanced > Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, > BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, Matrix > Factorization - M...ark 1.5.0 Documentation.pdf, testScript.scala > > > LU is the most common method to solve a general linear system or inverse a > general matrix. A distributed version could in implemented block-wise with > pipelining. A reference implementation is provided in ScaLAPACK: > http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job
[ https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726775#comment-14726775 ] Iulian Dragos commented on SPARK-7874: -- [~tomdz] do you mean respecting `spark.cores.max`, as it is the case in coarse-grained mode? > Add a global setting for the fine-grained mesos scheduler that limits the > number of concurrent tasks of a job > - > > Key: SPARK-7874 > URL: https://issues.apache.org/jira/browse/SPARK-7874 > Project: Spark > Issue Type: Wish > Components: Mesos >Affects Versions: 1.3.1 >Reporter: Thomas Dudziak >Priority: Minor > > This would be a very simple yet effective way to prevent a job dominating the > cluster. A way to override it per job would also be nice but not required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerome updated SPARK-8514: -- Attachment: Matrix Factorization - M...ark 1.5.0 Documentation.pdf I added a version of the Documentation that contains some of the design documentation for the LU algorithm. Some of the descriptions may not be necessary for Spark users, but could be useful for reviewers. Cheers, Jerome > LU factorization on BlockMatrix > --- > > Key: SPARK-8514 > URL: https://issues.apache.org/jira/browse/SPARK-8514 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng > Labels: advanced > Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, > BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, Matrix > Factorization - M...ark 1.5.0 Documentation.pdf, testScript.scala > > > LU is the most common method to solve a general linear system or inverse a > general matrix. A distributed version could in implemented block-wise with > pipelining. A reference implementation is provided in ScaLAPACK: > http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10324: -- Description: Following SPARK-8445, we created this master list for MLlib features we plan to have in Spark 1.6. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add `@Since("1.6.0")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * log-linear model for survival analysis (SPARK-8518) * normal equation approach for linear regression (SPARK-9834) * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) * robust linear regression with Huber loss (SPARK-3181) * vector-free L-BFGS (SPARK-10078) * tree partition by features (SPARK-3717) * bisecting k-means (SPARK-6517) * weighted instance support (SPARK-9610) ** logistic regression (SPARK-7685) ** linear regression (SPARK-9642) ** random forest (SPARK-9478) * locality sensitive hashing (LSH) (SPARK-5992) * deep learning (SPARK-2352) ** autoencoder (SPARK-4288) ** restricted Boltzmann machine (RBM) (SPARK-4251) ** convolutional neural network (stretch) * factorization machine (SPARK-7008) * local linear algebra (SPARK-6442) * distributed LU decomposition (SPARK-8514) h2. Statistics * univariate statistics as UDAFs (SPARK-10384) * bivariate statistics as UDAFs (SPARK-10385) * R-like statistics for GLMs (SPARK-9835) * online hypothesis testing (SPARK-3147) h2. Pipeline API * pipeline persistence (SPARK-6725) * ML attribute API improvements (SPARK-8515) * feature transformers (SPARK-9930) ** feature interaction (SPARK-9698) ** SQL transformer (SPARK-8345) ** ?? * predict single instance (SPARK-10413) * test Kaggle datasets (SPARK-9941) h2. Model persistence * PMML export ** naive Bayes (SPARK-8546) ** decision tree (SPARK-8542) * model save/load ** FPGrowth (SPARK-6724) ** PrefixSpan (SPARK-10386) * code generation ** decision tree and tree ensembles (SPARK-10387) h2. Data sources * LIBSVM data source (SPARK-10117) * public dataset loader (SPARK-10388) h2. Python API for ML The main goal of Python API is to have feature parity with Scala/Java API. You can find a complete list [here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall into two major categories: * Python API for new algorithms * Python API for missing methods h2. SparkR API for ML * support more families and link functions in SparkR::glm (SPARK-9838, SPARK-9839, SPARK-9840) * better R formula support (SPARK-9681) * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837) h2. Documentation * re-organize user guide (SPARK-8517) * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPAR
[jira] [Created] (SPARK-10413) Model should support prediction on single instance
Xiangrui Meng created SPARK-10413: - Summary: Model should support prediction on single instance Key: SPARK-10413 URL: https://issues.apache.org/jira/browse/SPARK-10413 Project: Spark Issue Type: Umbrella Components: ML Reporter: Xiangrui Meng Priority: Critical Currently models in the pipeline API only implement transform(DataFrame). It would be quite useful to support prediction on single instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9595) Adding API to SparkConf for kryo serializers registration
[ https://issues.apache.org/jira/browse/SPARK-9595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726751#comment-14726751 ] holdenk commented on SPARK-9595: I can do this if no one else is working on it. > Adding API to SparkConf for kryo serializers registration > - > > Key: SPARK-9595 > URL: https://issues.apache.org/jira/browse/SPARK-9595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1 >Reporter: John Chen >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > Currently SparkConf has a registerKryoClasses API for kryo registration. > However, this only works when you register classes. If you want to register > customized kryo serializers, you'll have to extend the KryoSerializer class > and write some codes. > This is not only very inconvenient, but also require the registration to be > done in compile-time, which is not always possible. Thus, I suggest another > API to SparkConf for registering customized kryo serializers. It could be > like this: > def registerKryoSerializers(serializers: Map[Class[_], Serializer]): SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726736#comment-14726736 ] Nick Xie commented on SPARK-3655: - I did exactly that, since I will always provide a comparator, I also took the liberty of removing a few overloaded constructors. Less is more when it comes to code maintenance. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726664#comment-14726664 ] Koert Kuipers commented on SPARK-3655: -- Did you build a version that does not use Optional for java api? [ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726658#comment-14726658 ] Nick Xie commented on SPARK-3655: - Thanks for the quick changes to rid of Ordering dependency. Since I am only using it in a specific way, through a few small hacks I was able to rid of the the entire runtime dependency on Guava. soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726658#comment-14726658 ] Nick Xie commented on SPARK-3655: - Thanks for the quick changes to rid of Ordering dependency. Since I am only using it in a specific way, through a few small hacks I was able to rid of the the entire runtime dependency on Guava. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10410) spark 1.4.1 kill command does not work with streaming job.
[ https://issues.apache.org/jira/browse/SPARK-10410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryce Ageno updated SPARK-10410: Shepherd: (was: Bryce Ageno) > spark 1.4.1 kill command does not work with streaming job. > -- > > Key: SPARK-10410 > URL: https://issues.apache.org/jira/browse/SPARK-10410 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.4.1 >Reporter: Bryce Ageno > > Our team recently upgraded a cluster to 1.4.1 from 1.3.1 and we discovered > that when you run the kill command for a driver (/usr/spark/bin/spark-submit > --master spark://$SPARK_MASTER_IP:6066 --kill $SPARK_DRIVER) it is not > removing the driver off of the sparkUI. It is a streaming job and the kill > command "ends" the job but it does not free up the resources or remove it > from the spark master. > We are running in cluster mode. We have also noticed that with 1.4.1 > multiple spark-submits all of the drivers ends up on a single worker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10034) add regression test for Sort on Aggregate
[ https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10034: Description: Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use aggregate expressions named `_aggOrdering` and can't use more than one ordering expressions which contains aggregate functions. The reason of this bug is that: The aggregate expression in `SortOrder` never get resolved, we alias it with `_aggOrdering` and call `toAttribute` which gives us an `UnresolvedAttribute`. So actually we are referencing aggregate expression by name, not by exprId like we thought. And if there is already an aggregate expression named `_aggOrdering` or there are more than one ordering expressions having aggregate functions, we will have conflict names and can't search by name. However, after #8371 got merged, the `SortOrder`s are guaranteed to be resolved and we are always referencing aggregate expression by exprId. The Bug doesn't exist anymore and this PR add regression tests for it. was: {code} val df = Seq(1 -> 2).toDF("i", "j") val query = df.groupBy('i) .agg(max('j).as("_aggOrdering")) .orderBy(sum('j)) checkAnswer(query, Row(1, 2)) {code} > add regression test for Sort on Aggregate > - > > Key: SPARK-10034 > URL: https://issues.apache.org/jira/browse/SPARK-10034 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use > aggregate expressions named `_aggOrdering` and can't use more than one > ordering expressions which contains aggregate functions. The reason of this > bug is that: The aggregate expression in `SortOrder` never get resolved, we > alias it with `_aggOrdering` and call `toAttribute` which gives us an > `UnresolvedAttribute`. So actually we are referencing aggregate expression by > name, not by exprId like we thought. And if there is already an aggregate > expression named `_aggOrdering` or there are more than one ordering > expressions having aggregate functions, we will have conflict names and can't > search by name. > However, after #8371 got merged, the `SortOrder`s are guaranteed to be > resolved and we are always referencing aggregate expression by exprId. The > Bug doesn't exist anymore and this PR add regression tests for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10034) add regression test for Sort on Aggregate
[ https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10034: Summary: add regression test for Sort on Aggregate (was: add regression test for sort on ) > add regression test for Sort on Aggregate > - > > Key: SPARK-10034 > URL: https://issues.apache.org/jira/browse/SPARK-10034 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > {code} > val df = Seq(1 -> 2).toDF("i", "j") > val query = df.groupBy('i) > .agg(max('j).as("_aggOrdering")) > .orderBy(sum('j)) > checkAnswer(query, Row(1, 2)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10034) add regression test for sort on
[ https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10034: Summary: add regression test for sort on (was: Can't analyze Sort on Aggregate with aggregation expression named "_aggOrdering") > add regression test for sort on > > > Key: SPARK-10034 > URL: https://issues.apache.org/jira/browse/SPARK-10034 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > {code} > val df = Seq(1 -> 2).toDF("i", "j") > val query = df.groupBy('i) > .agg(max('j).as("_aggOrdering")) > .orderBy(sum('j)) > checkAnswer(query, Row(1, 2)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10412) In SQL tab, show execution memory per physical operator
Andrew Or created SPARK-10412: - Summary: In SQL tab, show execution memory per physical operator Key: SPARK-10412 URL: https://issues.apache.org/jira/browse/SPARK-10412 Project: Spark Issue Type: Bug Components: SQL, Web UI Affects Versions: 1.5.0 Reporter: Andrew Or We already display it per task / stage. It's really useful to also display it per operator so the user can know which one caused all the memory to be allocated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10411) In SQL tab move visualization above explain output
Andrew Or created SPARK-10411: - Summary: In SQL tab move visualization above explain output Key: SPARK-10411 URL: https://issues.apache.org/jira/browse/SPARK-10411 Project: Spark Issue Type: Bug Components: SQL, Web UI Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Shixiong Zhu Request from [~pwendell]: (1) The visualization is much more interesting than the DF explain output. That should be at the top of the page. (2) The DF explain output is for advanced users and should be collapsed by default These are just minor UX optimizations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size
[ https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726557#comment-14726557 ] Xiaoyu Wang commented on SPARK-10314: - I resubmit the pull request on the master branch > [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception > when parallelism is big than data split size > > > Key: SPARK-10314 > URL: https://issues.apache.org/jira/browse/SPARK-10314 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.4.1 > Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4 >Reporter: Xiaoyu Wang >Priority: Minor > > RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when > parallelism is big than data split size > {code} > val rdd = sc.parallelize(List(1, 2),2) > rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) > rdd.count() > {code} > is ok. > {code} > val rdd = sc.parallelize(List(1, 2),3) > rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) > rdd.count() > {code} > got exceptoin: > {noformat} > 15/08/27 17:53:07 INFO SparkContext: Starting job: count at :24 > 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at :24) with 3 > output partitions (allowLocal=false) > 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at > :24) > 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List() > 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List() > 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 > (ParallelCollectionRDD[0] at parallelize at :21), which has no > missing parents > 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with > curMem=0, maxMem=741196431 > 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 1096.0 B, free 706.9 MB) > 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with > curMem=1096, maxMem=741196431 > 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 788.0 B, free 706.9 MB) > 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on localhost:43776 (size: 788.0 B, free: 706.9 MB) > 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:874 > 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from > ResultStage 0 (ParallelCollectionRDD[0] at parallelize at :21) > 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 1269 bytes) > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, PROCESS_LOCAL, 1270 bytes) > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, > localhost, PROCESS_LOCAL, 1270 bytes) > 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2) > 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it > 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started > 15/08/27 17:53:08 WARN : tachyon.home is not set. Using > /mnt/tachyon_default_home as the default value. > 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect > master @ localhost/127.0.0.1:19998 > 15/08/27 17:53:08 INFO : User registered at the master > localhost/127.0.0.1:19998 got UserId 109 > 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at > /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5 > 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost > 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998 > 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was > created! > 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 > was created! > 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 > was created! > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore > on localhost:43776 (size: 0.0 B) > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore > on localhost:43776 (size: 2.0 B) > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_2 on ExternalBlockStore > on localhost:43776 (size: 2.0 B) > 15/08/27 17:53:08 INFO BlockManager:
[jira] [Comment Edited] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size
[ https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726557#comment-14726557 ] Xiaoyu Wang edited comment on SPARK-10314 at 9/2/15 1:33 AM: - I resubmit the pull request on the master branch https://github.com/apache/spark/pull/8562 was (Author: wangxiaoyu): I resubmit the pull request on the master branch > [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception > when parallelism is big than data split size > > > Key: SPARK-10314 > URL: https://issues.apache.org/jira/browse/SPARK-10314 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.4.1 > Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4 >Reporter: Xiaoyu Wang >Priority: Minor > > RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when > parallelism is big than data split size > {code} > val rdd = sc.parallelize(List(1, 2),2) > rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) > rdd.count() > {code} > is ok. > {code} > val rdd = sc.parallelize(List(1, 2),3) > rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) > rdd.count() > {code} > got exceptoin: > {noformat} > 15/08/27 17:53:07 INFO SparkContext: Starting job: count at :24 > 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at :24) with 3 > output partitions (allowLocal=false) > 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at > :24) > 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List() > 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List() > 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 > (ParallelCollectionRDD[0] at parallelize at :21), which has no > missing parents > 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with > curMem=0, maxMem=741196431 > 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 1096.0 B, free 706.9 MB) > 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with > curMem=1096, maxMem=741196431 > 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 788.0 B, free 706.9 MB) > 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on localhost:43776 (size: 788.0 B, free: 706.9 MB) > 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:874 > 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from > ResultStage 0 (ParallelCollectionRDD[0] at parallelize at :21) > 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 1269 bytes) > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, PROCESS_LOCAL, 1270 bytes) > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, > localhost, PROCESS_LOCAL, 1270 bytes) > 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2) > 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it > 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started > 15/08/27 17:53:08 WARN : tachyon.home is not set. Using > /mnt/tachyon_default_home as the default value. > 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect > master @ localhost/127.0.0.1:19998 > 15/08/27 17:53:08 INFO : User registered at the master > localhost/127.0.0.1:19998 got UserId 109 > 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at > /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5 > 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost > 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998 > 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was > created! > 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 > was created! > 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 > was created! > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore > on localhost:43776 (size: 0.0 B) > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore > on localhost:43776 (s
[jira] [Updated] (SPARK-4122) Add library to write data back to Kafka
[ https://issues.apache.org/jira/browse/SPARK-4122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4122: - Target Version/s: 1.6.0 > Add library to write data back to Kafka > --- > > Key: SPARK-4122 > URL: https://issues.apache.org/jira/browse/SPARK-4122 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size
[ https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726551#comment-14726551 ] Apache Spark commented on SPARK-10314: -- User 'romansew' has created a pull request for this issue: https://github.com/apache/spark/pull/8562 > [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception > when parallelism is big than data split size > > > Key: SPARK-10314 > URL: https://issues.apache.org/jira/browse/SPARK-10314 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.4.1 > Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4 >Reporter: Xiaoyu Wang >Priority: Minor > > RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when > parallelism is big than data split size > {code} > val rdd = sc.parallelize(List(1, 2),2) > rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) > rdd.count() > {code} > is ok. > {code} > val rdd = sc.parallelize(List(1, 2),3) > rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) > rdd.count() > {code} > got exceptoin: > {noformat} > 15/08/27 17:53:07 INFO SparkContext: Starting job: count at :24 > 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at :24) with 3 > output partitions (allowLocal=false) > 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at > :24) > 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List() > 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List() > 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 > (ParallelCollectionRDD[0] at parallelize at :21), which has no > missing parents > 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with > curMem=0, maxMem=741196431 > 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 1096.0 B, free 706.9 MB) > 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with > curMem=1096, maxMem=741196431 > 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 788.0 B, free 706.9 MB) > 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on localhost:43776 (size: 788.0 B, free: 706.9 MB) > 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:874 > 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from > ResultStage 0 (ParallelCollectionRDD[0] at parallelize at :21) > 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 1269 bytes) > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, PROCESS_LOCAL, 1270 bytes) > 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, > localhost, PROCESS_LOCAL, 1270 bytes) > 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2) > 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it > 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it > 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started > 15/08/27 17:53:08 WARN : tachyon.home is not set. Using > /mnt/tachyon_default_home as the default value. > 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect > master @ localhost/127.0.0.1:19998 > 15/08/27 17:53:08 INFO : User registered at the master > localhost/127.0.0.1:19998 got UserId 109 > 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at > /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5 > 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost > 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998 > 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was > created! > 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 > was created! > 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 > was created! > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore > on localhost:43776 (size: 0.0 B) > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore > on localhost:43776 (size: 2.0 B) > 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_2 on ExternalBlockStore > on localhost:43776
[jira] [Updated] (SPARK-3586) Support nested directories in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3586: - Target Version/s: 1.6.0 > Support nested directories in Spark Streaming > - > > Key: SPARK-3586 > URL: https://issues.apache.org/jira/browse/SPARK-3586 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Minor > > For text files, the method streamingContext.textFileStream(dataDirectory). > Spark Streaming will monitor the directory dataDirectory and process any > files created in that directory.but files written in nested directories not > supported > eg > streamingContext.textFileStream(/test). > Look at the direction contents: > /test/file1 > /test/file2 > /test/dr/file1 > In this mothod the "textFileStream" can only read file: > /test/file1 > /test/file2 > /test/dr/ > but the file: /test/dr/file1 is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3586) Support nested directories in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3586: - Labels: (was: patch) > Support nested directories in Spark Streaming > - > > Key: SPARK-3586 > URL: https://issues.apache.org/jira/browse/SPARK-3586 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Minor > > For text files, the method streamingContext.textFileStream(dataDirectory). > Spark Streaming will monitor the directory dataDirectory and process any > files created in that directory.but files written in nested directories not > supported > eg > streamingContext.textFileStream(/test). > Look at the direction contents: > /test/file1 > /test/file2 > /test/dr/file1 > In this mothod the "textFileStream" can only read file: > /test/file1 > /test/file2 > /test/dr/ > but the file: /test/dr/file1 is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423 ] Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM: --- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp (https://github.com/avulanov/spark/blob/autoencoder-mlp/mllib/src/main/scala/org/apache/spark/ml/feature/Autoencoder.scala) was (Author: avulanov): Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp ( > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423 ] Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM: --- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp ( was (Author: avulanov): Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp (https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala) > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423 ] Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:54 PM: --- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp (https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala) was (Author: avulanov): Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10408: - Description: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers: References: 1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf was: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10410) spark 1.4.1 kill command does not work with streaming job.
Bryce Ageno created SPARK-10410: --- Summary: spark 1.4.1 kill command does not work with streaming job. Key: SPARK-10410 URL: https://issues.apache.org/jira/browse/SPARK-10410 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.4.1 Reporter: Bryce Ageno Our team recently upgraded a cluster to 1.4.1 from 1.3.1 and we discovered that when you run the kill command for a driver (/usr/spark/bin/spark-submit --master spark://$SPARK_MASTER_IP:6066 --kill $SPARK_DRIVER) it is not removing the driver off of the sparkUI. It is a streaming job and the kill command "ends" the job but it does not free up the resources or remove it from the spark master. We are running in cluster mode. We have also noticed that with 1.4.1 multiple spark-submits all of the drivers ends up on a single worker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10409) Multilayer perceptron regression
Alexander Ulanov created SPARK-10409: Summary: Multilayer perceptron regression Key: SPARK-10409 URL: https://issues.apache.org/jira/browse/SPARK-10409 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Alexander Ulanov Priority: Minor Implement regression based on multilayer perceptron (MLP). It should support different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. The implementation might take advantage of autoencoder. Time-series forecasting for financial data might be one of the use cases, see http://dl.acm.org/citation.cfm?id=561452. So there is the need for more specific requirements from this (or other) area. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10409) Multilayer perceptron regression
[ https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726435#comment-14726435 ] Alexander Ulanov commented on SPARK-10409: -- Basic implementation with the current ML api can be found here: https://github.com/avulanov/spark/blob/a2261330c227be8ef26172dbe355a617d653553a/mllib/src/main/scala/org/apache/spark/ml/regression/MultilayerPerceptronRegressor.scala > Multilayer perceptron regression > > > Key: SPARK-10409 > URL: https://issues.apache.org/jira/browse/SPARK-10409 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Implement regression based on multilayer perceptron (MLP). It should support > different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. > The implementation might take advantage of autoencoder. Time-series > forecasting for financial data might be one of the use cases, see > http://dl.acm.org/citation.cfm?id=561452. So there is the need for more > specific requirements from this (or other) area. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726423#comment-14726423 ] Alexander Ulanov commented on SPARK-10408: -- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10408: - Issue Type: Umbrella (was: Improvement) > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10408) Autoencoder
Alexander Ulanov created SPARK-10408: Summary: Autoencoder Key: SPARK-10408 URL: https://issues.apache.org/jira/browse/SPARK-10408 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Alexander Ulanov Priority: Minor Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10387) Code generation for decision tree
[ https://issues.apache.org/jira/browse/SPARK-10387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726392#comment-14726392 ] DB Tsai commented on SPARK-10387: - Here is the current research result from us. We implemented a prototype of code generation for trees, and here is the implementation of code-gen. https://github.com/dbtsai/tree/blob/master/macros/src/main/scala/Tree.scala 1) We found the performance of code-gen is 4x to 6x faster than naive binary tree when the # of trees used in GBDT are small. But with around 500x trees, the performance is slightly worse. 2) We're also benchmarking the flatten trees idea described here, http://tullo.ch/articles/decision-tree-evaluation/ 3) Finally, QuickScorer: A Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees http://delivery.acm.org/10.1145/277/2767733/p73-lucchese.pdf is being implemented, we will benchmark it as well. > Code generation for decision tree > - > > Key: SPARK-10387 > URL: https://issues.apache.org/jira/browse/SPARK-10387 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: DB Tsai > > Provide code generation for decision tree and tree ensembles. Let's first > discuss the design and then create new JIRAs for tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8534) Gini for regression metrics and evaluator
[ https://issues.apache.org/jira/browse/SPARK-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726364#comment-14726364 ] Ehsan Mohyedin Kermani commented on SPARK-8534: --- I'd like to give it a shot but first I think, we need distributed scan function for computing the cumulative sum of the sorted predictions. Would it be possible to add that to RegressionMetrics or perhaps mllib.util first? An implementation was suggested here https://groups.google.com/forum/#!topic/spark-users/ts-FdB50ltY. > Gini for regression metrics and evaluator > - > > Key: SPARK-8534 > URL: https://issues.apache.org/jira/browse/SPARK-8534 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > One common metric we do not have in RegressionMetrics or RegressionEvaluator > is Gini: [https://www.kaggle.com/wiki/Gini] > Implementing (normalized) Gini would be nice. However, it might be > expensive; I believe it would require sorting the labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10405) Support takeOrdered and topK values per key
[ https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726331#comment-14726331 ] ashish shenoy edited comment on SPARK-10405 at 9/1/15 10:39 PM: [~srowen] yes, technically its a good to have not a must have. I can think of many instances where such an API would be very convenient and useful for users to use. I was using the aggregateByKey() with a custom written bounded priority queue. As per the spark documentation, the func param to foldByKey() should be an associative merge function. So I can think of how this can be used to get the max or min value per key, but not the top or bottom values. Since I am a spark-newbie, can you pls give an example of how one could use a priorityQueue with foldByKey() ? Also, the default PriorityQueue implementation in java.util is unbounded; could this cause OOM exceptions if the cardinality of the keyset is very large ? was (Author: ashishen...@gmail.com): [~srowen] yes, technically its a good to have not a must have. I could think of many instances where such an API would be very convenient and useful for users to have. I was using the aggregateByKey() with a custom written bounded priority queue. As per the spark documentation, the func param to foldByKey() should be an associative merge function. So I can think of how this can be used to get the max or min value per key, but not the top or bottom values. Since I am a spark-noob, can you pls give an example of how one could use a priorityQueue with foldByKey() ? Also, the default PriorityQueue implementation in java.util is unbounded; could this cause OOM exceptions if the cardinality of the keyset is very large ? > Support takeOrdered and topK values per key > --- > > Key: SPARK-10405 > URL: https://issues.apache.org/jira/browse/SPARK-10405 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: ashish shenoy > Labels: features, newbie > > Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" > items from a given RDD. > It'd be good to have an API that returned the "top" values per key for a > keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the > task is to only display an ordered subset of the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10405) Support takeOrdered and topK values per key
[ https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726331#comment-14726331 ] ashish shenoy edited comment on SPARK-10405 at 9/1/15 10:38 PM: [~srowen] yes, technically its a good to have not a must have. I could think of many instances where such an API would be very convenient and useful for users to have. I was using the aggregateByKey() with a custom written bounded priority queue. As per the spark documentation, the func param to foldByKey() should be an associative merge function. So I can think of how this can be used to get the max or min value per key, but not the top or bottom values. Since I am a spark-noob, can you pls give an example of how one could use a priorityQueue with foldByKey() ? Also, the default PriorityQueue implementation in java.util is unbounded; could this cause OOM exceptions if the cardinality of the keyset is very large ? was (Author: ashishen...@gmail.com): [~srowen] yes, technically its a good to have not a must have. I could think of many instances where such an API would be very convenient and useful for users to have. Thanks for that foldByKey() tip; I was using the aggregateByKey() with a custom written bounded priority queue. Since I am a spark-noob, can you pls give an example of how one could use a priorityQueue with foldByKey() ? Also, the default PriorityQueue implementation in java.util is unbounded; could this cause OOM exceptions if the cardinality of the keyset is very large ? > Support takeOrdered and topK values per key > --- > > Key: SPARK-10405 > URL: https://issues.apache.org/jira/browse/SPARK-10405 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: ashish shenoy > Labels: features, newbie > > Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" > items from a given RDD. > It'd be good to have an API that returned the "top" values per key for a > keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the > task is to only display an ordered subset of the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage
[ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726339#comment-14726339 ] Kay Ousterhout commented on SPARK-2666: --- [~irashid] totally agree, and IIRC there's a TODO suggesting we kill all remaining running tasks once a stage becomes a zombie somewhere in the scheduler code. > when task is FetchFailed cancel running tasks of failedStage > > > Key: SPARK-2666 > URL: https://issues.apache.org/jira/browse/SPARK-2666 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Lianhui Wang > > in DAGScheduler's handleTaskCompletion,when reason of failed task is > FetchFailed, cancel running tasks of failedStage before add failedStage to > failedStages queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10405) Support takeOrdered and topK values per key
[ https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726331#comment-14726331 ] ashish shenoy commented on SPARK-10405: --- [~srowen] yes, technically its a good to have not a must have. I could think of many instances where such an API would be very convenient and useful for users to have. Thanks for that foldByKey() tip; I was using the aggregateByKey() with a custom written bounded priority queue. Since I am a spark-noob, can you pls give an example of how one could use a priorityQueue with foldByKey() ? Also, the default PriorityQueue implementation in java.util is unbounded; could this cause OOM exceptions if the cardinality of the keyset is very large ? > Support takeOrdered and topK values per key > --- > > Key: SPARK-10405 > URL: https://issues.apache.org/jira/browse/SPARK-10405 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: ashish shenoy > Labels: features, newbie > > Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" > items from a given RDD. > It'd be good to have an API that returned the "top" values per key for a > keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the > task is to only display an ordered subset of the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10407) Possible Stack-overflow using InheritableThreadLocal nested-properties for SparkContext.localProperties
[ https://issues.apache.org/jira/browse/SPARK-10407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah updated SPARK-10407: --- Description: In my long-running web server that eventually uses a SparkContext, I eventually came across some stack overflow errors that could only be cleared by restarting my server. {code} java.lang.StackOverflowError: null at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2307) ~[na:1.7.0_45] at java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2718) ~[na:1.7.0_45] at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) ~[na:1.7.0_45] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1979) ~[na:1.7.0_45] at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) ~[na:1.7.0_45] ... ... at org.apache.commons.lang3.SerializationUtils.clone(SerializationUtils.java:96) ~[commons-lang3-3.3.jar:3.3] at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:516) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:529) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1770) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1788) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1803) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1276) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] ... {code} The bottom of the trace indicates that serializing a properties object is part of the stack when the overflow happens. I checked the origin of the properties, and it turns out it's coming from SparkContext.localProperties, an InheritableThreadLocal field. When I debugged further, I found that localProperties.childValue() wraps its parent properties object in another properties object, and returns the wrapper properties. The problem is that every time childValue was being called, I was seeing the properties that was passed in from the parent have a deeper and deeper nesting of wrapped properties. This doesn't make any sense since my application doesn't create threads recursively or anything like that, so I'm marking this issue as a minor one since it shouldn't affect the average application. On the other hand, there shouldn't really be any reason to be creating the properties in childValue using nesting. Instead, the properties returned by childValue should be flattened, and more importantly, a deep copy of the parent.I'm also concerned about the parent thread possibly modifying the wrapped properties object while it's being used by the child thread, creating possible race conditions since Properties is not thread-safe. was: In my long-running web server that eventually uses a SparkContext, I eventually came across some stack overflow errors that could only be cleared by restarting my server. {code} java.lang.StackOverflowError: null at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2307) ~[na:1.7.0_45] at java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2718) ~[na:1.7.0_45] at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) ~[na:1.7.0_45] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1979) ~[na:1.7.0_45] at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) ~[na:1.7.0_45] ... ... at org.apache.commons.lang3.SerializationUtils.clone(SerializationUtils.java:96) ~[commons-lang3-3.3.jar:3.3] at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:516) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:529) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1770) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1788) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1803) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1276) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] ... {code} The bottom of the trace indicates that serializing a properties object is part of the stack when the overflow happens. I checked the origin of the properties, and it turns
[jira] [Created] (SPARK-10407) Possible Stack-overflow using InheritableThreadLocal nested-properties for SparkContext.localProperties
Matt Cheah created SPARK-10407: -- Summary: Possible Stack-overflow using InheritableThreadLocal nested-properties for SparkContext.localProperties Key: SPARK-10407 URL: https://issues.apache.org/jira/browse/SPARK-10407 Project: Spark Issue Type: Bug Reporter: Matt Cheah Priority: Minor In my long-running web server that eventually uses a SparkContext, I eventually came across some stack overflow errors that could only be cleared by restarting my server. {code} java.lang.StackOverflowError: null at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2307) ~[na:1.7.0_45] at java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2718) ~[na:1.7.0_45] at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) ~[na:1.7.0_45] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1979) ~[na:1.7.0_45] at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) ~[na:1.7.0_45] ... ... at org.apache.commons.lang3.SerializationUtils.clone(SerializationUtils.java:96) ~[commons-lang3-3.3.jar:3.3] at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:516) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:529) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1770) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1788) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1803) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1276) ~[spark-core_2.10-1.4.1-palantir1.jar:1.4.1-palantir1] ... {code} The bottom of the trace indicates that serializing a properties object is part of the stack when the overflow happens. I checked the origin of the properties, and it turns out it's coming from SparkContext.localProperties, an InheritableThreadLocal field. When I debugged further, I found that localProperties.childValue() wraps its parent properties object in another properties object, and returns the wrapper properties. The problem is that every time childValue was being called, I was seeing the properties that was passed in from the parent have a deeper and deeper nesting of wrapped properties. This doesn't make any sense since my application doesn't create threads recursively or anything like that, so I'm marking this issue as a minor one since it shouldn't affect the average application. On the other hand, there shouldn't really be any reason to be creating the properties in childValue using nesting. Instead, the properties returned by childValue should be flattened, and more importantly, a deep copy of the parent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10406) Document spark on yarn distributed cache symlink functionality
Thomas Graves created SPARK-10406: - Summary: Document spark on yarn distributed cache symlink functionality Key: SPARK-10406 URL: https://issues.apache.org/jira/browse/SPARK-10406 Project: Spark Issue Type: Bug Components: Documentation, YARN Affects Versions: 1.5.0 Reporter: Thomas Graves Spark on Yarn supports using the distributed cache via --files, --jars, --archives. It also supports specifying a name for those via #. ie foo.tgz#myname. myname is what foo.tgz is unarchived as and shows up in the local directory of the application . Similarly for files and jars. We should document this support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10405) Support takeOrdered and topK values per key
[ https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726296#comment-14726296 ] Sean Owen commented on SPARK-10405: --- This is fairly easy already with foldByKey and a priority queue -- does it really need its own API method? > Support takeOrdered and topK values per key > --- > > Key: SPARK-10405 > URL: https://issues.apache.org/jira/browse/SPARK-10405 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: ashish shenoy > Labels: features, newbie > > Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" > items from a given RDD. > It'd be good to have an API that returned the "top" values per key for a > keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the > task is to only display an ordered subset of the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10405) Support takeOrdered and topK values per key
[ https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ashish shenoy updated SPARK-10405: -- Description: Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" values from a given RDD. It'd be good to have an API that returned the "top" values per key for a keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the task is to only display an ordered subset of the input data. was: Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" items from a given RDD. It'd be good to have an API that returned the "top" values per key for a keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the task is to only display an ordered subset of the input data. > Support takeOrdered and topK values per key > --- > > Key: SPARK-10405 > URL: https://issues.apache.org/jira/browse/SPARK-10405 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: ashish shenoy > Labels: features, newbie > > Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" > values from a given RDD. > It'd be good to have an API that returned the "top" values per key for a > keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the > task is to only display an ordered subset of the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10405) Support takeOrdered and topK values per key
[ https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ashish shenoy updated SPARK-10405: -- Description: Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" items from a given RDD. It'd be good to have an API that returned the "top" values per key for a keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the task is to only display an ordered subset of the input data. was: Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" values from a given RDD. It'd be good to have an API that returned the "top" values per key for a keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the task is to only display an ordered subset of the input data. > Support takeOrdered and topK values per key > --- > > Key: SPARK-10405 > URL: https://issues.apache.org/jira/browse/SPARK-10405 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: ashish shenoy > Labels: features, newbie > > Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" > items from a given RDD. > It'd be good to have an API that returned the "top" values per key for a > keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the > task is to only display an ordered subset of the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10405) Support takeOrdered and topK values per key
[ https://issues.apache.org/jira/browse/SPARK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ashish shenoy updated SPARK-10405: -- Description: Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" items from a given RDD. It'd be good to have an API that returned the "top" values per key for a keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the task is to only display an ordered subset of the input data. was: Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" items from a given RDD. It'd be good to have an API that returned the "top" items per key for a keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the task is to only display an ordered subset of the input data. > Support takeOrdered and topK values per key > --- > > Key: SPARK-10405 > URL: https://issues.apache.org/jira/browse/SPARK-10405 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: ashish shenoy > Labels: features, newbie > > Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" > items from a given RDD. > It'd be good to have an API that returned the "top" values per key for a > keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the > task is to only display an ordered subset of the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10405) Support takeOrdered and topK values per key
ashish shenoy created SPARK-10405: - Summary: Support takeOrdered and topK values per key Key: SPARK-10405 URL: https://issues.apache.org/jira/browse/SPARK-10405 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: ashish shenoy Spark provides the top() and takeOrdered() APIs that return "top" or "bottom" items from a given RDD. It'd be good to have an API that returned the "top" items per key for a keyed RDD i.e. RDDpair. Such an API would be very useful for cases where the task is to only display an ordered subset of the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10392: --- Fix Version/s: 1.5.1 > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > Fix For: 1.6.0, 1.5.1 > > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10392. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8556 [https://github.com/apache/spark/pull/8556] > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > Fix For: 1.6.0 > > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10162) PySpark filters with datetimes mess up when datetimes have timezones.
[ https://issues.apache.org/jira/browse/SPARK-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10162. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8555 [https://github.com/apache/spark/pull/8555] > PySpark filters with datetimes mess up when datetimes have timezones. > - > > Key: SPARK-10162 > URL: https://issues.apache.org/jira/browse/SPARK-10162 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Kevin Cox > Fix For: 1.6.0 > > > PySpark appears to ignore timezone information when filtering on (and working > in general with) datetimes. > Please see the example below. The generated filter in the query plan is 5 > hours off (my computer is EST). > {code} > In [1]: df = sc.sql.createDataFrame([], StructType([StructField("dt", > TimestampType())])) > In [2]: df.filter(df.dt > datetime(2000, 01, 01, tzinfo=UTC)).explain() > Filter (dt#9 > 9467028) > Scan PhysicalRDD[dt#9] > {code} > Note that 9467028 == Sat 1 Jan 2000 05:00:00 UTC -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9516) Improve Thread Dump page
[ https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9516: - Target Version/s: 1.6.0 > Improve Thread Dump page > > > Key: SPARK-9516 > URL: https://issues.apache.org/jira/browse/SPARK-9516 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Nan Zhu > > Originally proposed by [~irashid] in > https://github.com/apache/spark/pull/7808#issuecomment-126788335: > we can enhance the current thread dump page with at least the following two > new features: > 1) sort threads by thread status, > 2) a filter to grep the threads -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9516) Improve Thread Dump page
[ https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9516: - Assignee: Nan Zhu > Improve Thread Dump page > > > Key: SPARK-9516 > URL: https://issues.apache.org/jira/browse/SPARK-9516 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Nan Zhu >Assignee: Nan Zhu > > Originally proposed by [~irashid] in > https://github.com/apache/spark/pull/7808#issuecomment-126788335: > we can enhance the current thread dump page with at least the following two > new features: > 1) sort threads by thread status, > 2) a filter to grep the threads -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9769) Add Python API for ml.feature.CountVectorizerModel
[ https://issues.apache.org/jira/browse/SPARK-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9769: --- Assignee: Apache Spark > Add Python API for ml.feature.CountVectorizerModel > -- > > Key: SPARK-9769 > URL: https://issues.apache.org/jira/browse/SPARK-9769 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > Add Python API, user guide and example for ml.feature.CountVectorizerModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9769) Add Python API for ml.feature.CountVectorizerModel
[ https://issues.apache.org/jira/browse/SPARK-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9769: --- Assignee: (was: Apache Spark) > Add Python API for ml.feature.CountVectorizerModel > -- > > Key: SPARK-9769 > URL: https://issues.apache.org/jira/browse/SPARK-9769 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add Python API, user guide and example for ml.feature.CountVectorizerModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9769) Add Python API for ml.feature.CountVectorizerModel
[ https://issues.apache.org/jira/browse/SPARK-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726265#comment-14726265 ] Apache Spark commented on SPARK-9769: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/8561 > Add Python API for ml.feature.CountVectorizerModel > -- > > Key: SPARK-9769 > URL: https://issues.apache.org/jira/browse/SPARK-9769 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add Python API, user guide and example for ml.feature.CountVectorizerModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10404) Worker should terminate previous executor before launch new one
Davies Liu created SPARK-10404: -- Summary: Worker should terminate previous executor before launch new one Key: SPARK-10404 URL: https://issues.apache.org/jira/browse/SPARK-10404 Project: Spark Issue Type: Bug Reporter: Davies Liu Reported here: http://apache-spark-user-list.1001560.n3.nabble.com/Hung-spark-executors-don-t-count-toward-worker-memory-limit-td16083.html#a24548 If new launched executor is overlapped with previous ones, they could run out of memory in the machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4223) Support * (meaning all users) as part of the acls
[ https://issues.apache.org/jira/browse/SPARK-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4223. Resolution: Fixed Fix Version/s: 1.6.0 > Support * (meaning all users) as part of the acls > - > > Key: SPARK-4223 > URL: https://issues.apache.org/jira/browse/SPARK-4223 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Thomas Graves >Assignee: Zhuo Liu > Fix For: 1.6.0 > > > Currently we support setting view and modify acls but you have to specify a > list of users. It would be nice to support * meaning all users have access. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5269) BlockManager.dataDeserialize always creates a new serializer instance
[ https://issues.apache.org/jira/browse/SPARK-5269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5269: - Target Version/s: 1.6.0 > BlockManager.dataDeserialize always creates a new serializer instance > - > > Key: SPARK-5269 > URL: https://issues.apache.org/jira/browse/SPARK-5269 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ivan Vergiliev >Assignee: Matt Cheah > Labels: performance, serializers > > BlockManager.dataDeserialize always creates a new instance of the serializer, > which is pretty slow in some cases. I'm using Kryo serialization and have a > custom registrator, and its register method is showing up as taking about 15% > of the execution time in my profiles. This started happening after I > increased the number of keys in a job with a shuffle phase by a factor of 40. > One solution I can think of is to create a ThreadLocal SerializerInstance for > the defaultSerializer, and only create a new one if a custom serializer is > passed in. AFAICT a custom serializer is passed only from > DiskStore.getValues, and that, on the other hand, depends on the serializer > passed to ExternalSorter. I don't know how often this is used, but I think > this can still be a good solution for the standard use case. > Oh, and also - ExternalSorter already has a SerializerInstance, so if the > getValues method is called from a single thread, maybe we can pass that > directly? > I'd be happy to try a patch but would probably need a confirmation from > someone that this approach would indeed work (or an idea for another). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10081) Skip re-computing getMissingParentStages in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-10081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10081: -- Target Version/s: 1.6.0 > Skip re-computing getMissingParentStages in DAGScheduler > > > Key: SPARK-10081 > URL: https://issues.apache.org/jira/browse/SPARK-10081 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Liang-Chi Hsieh > > In DAGScheduler, we can skip re-computing getMissingParentStages when calling > submitStage in handleJobSubmitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10081) Skip re-computing getMissingParentStages in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-10081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10081: -- Issue Type: Improvement (was: Bug) > Skip re-computing getMissingParentStages in DAGScheduler > > > Key: SPARK-10081 > URL: https://issues.apache.org/jira/browse/SPARK-10081 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Liang-Chi Hsieh > > In DAGScheduler, we can skip re-computing getMissingParentStages when calling > submitStage in handleJobSubmitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10247) Cleanup DAGSchedulerSuite "ignore late map task completion"
[ https://issues.apache.org/jira/browse/SPARK-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10247: -- Component/s: (was: Spark Core) Tests Scheduler > Cleanup DAGSchedulerSuite "ignore late map task completion" > --- > > Key: SPARK-10247 > URL: https://issues.apache.org/jira/browse/SPARK-10247 > Project: Spark > Issue Type: Test > Components: Scheduler, Tests >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > > the "ignore late map task completion" test in {{DAGSchedulerSuite}} is a bit > confusing, we can add a few asserts & comments to clarify a little -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10247) Cleanup DAGSchedulerSuite "ignore late map task completion"
[ https://issues.apache.org/jira/browse/SPARK-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10247: -- Target Version/s: 1.6.0 > Cleanup DAGSchedulerSuite "ignore late map task completion" > --- > > Key: SPARK-10247 > URL: https://issues.apache.org/jira/browse/SPARK-10247 > Project: Spark > Issue Type: Test > Components: Scheduler, Tests >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Trivial > > the "ignore late map task completion" test in {{DAGSchedulerSuite}} is a bit > confusing, we can add a few asserts & comments to clarify a little -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10247) Cleanup DAGSchedulerSuite "ignore late map task completion"
[ https://issues.apache.org/jira/browse/SPARK-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10247: -- Priority: Trivial (was: Minor) > Cleanup DAGSchedulerSuite "ignore late map task completion" > --- > > Key: SPARK-10247 > URL: https://issues.apache.org/jira/browse/SPARK-10247 > Project: Spark > Issue Type: Test > Components: Scheduler, Tests >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Trivial > > the "ignore late map task completion" test in {{DAGSchedulerSuite}} is a bit > confusing, we can add a few asserts & comments to clarify a little -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726076#comment-14726076 ] Cody Koeninger commented on SPARK-10320: You would supply a function, similar to the way createDirectStream currently takes a messageHandler: MessageAndMetadata[K, V] => R The type of that function would be (Time, Map[TopicAndPartition, Long], Map[TopicAndPartition, LeaderOffset]) => (Map[TopicAndPartition, Long, Map[TopicAndPartition, LeaderOffset]) in other words (time, fromOffsets, untilOffsets) => (fromOffsets, untilOffsets) Your function would be called in the compute() method of the dstream, after contacting the leaders and before making the rdd for the next batch. That would let you make arbitrary modifications to the topics / partitions / offsets. As far as the desire for a general solution, I think this is a kafka-specific concern. Not all streams have topics. > Kafka Support new topic subscriptions without requiring restart of the > streaming context > > > Key: SPARK-10320 > URL: https://issues.apache.org/jira/browse/SPARK-10320 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Sudarshan Kadambi > > Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe > to current ones once the streaming context has been started. Restarting the > streaming context increases the latency of update handling. > Consider a streaming application subscribed to n topics. Let's say 1 of the > topics is no longer needed in streaming analytics and hence should be > dropped. We could do this by stopping the streaming context, removing that > topic from the topic list and restarting the streaming context. Since with > some DStreams such as DirectKafkaStream, the per-partition offsets are > maintained by Spark, we should be able to resume uninterrupted (I think?) > from where we left off with a minor delay. However, in instances where > expensive state initialization (from an external datastore) may be needed for > datasets published to all topics, before streaming updates can be applied to > it, it is more convenient to only subscribe or unsubcribe to the incremental > changes to the topic list. Without such a feature, updates go unprocessed for > longer than they need to be, thus affecting QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10288) Add a rest client for Spark on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725955#comment-14725955 ] Marcelo Vanzin commented on SPARK-10288: So can that instead be used as the reasoning in the design document? It talks about standalone and mesos having rest servers as if that by itself is a reason to have support for rest. The PR also talks about how now "YARN also supports this function", but since the backends are completely different in all cases, it makes no sense to mention standalone or mesos here. > Add a rest client for Spark on Yarn > --- > > Key: SPARK-10288 > URL: https://issues.apache.org/jira/browse/SPARK-10288 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > This is a proposal to add rest client for Spark on Yarn. Currently Spark > standalone and Mesos mode can support rest way of submitting applications, > for Spark on Yarn, it still uses program way to do it. Since RM now (from > 2.6) supports rest way of submitting application, so it would be better Spark > on Yarn also support this way. > Here is the design doc > (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing). > Currently I'm working on it, working branch is > (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major > part is already finished. > Any comment is greatly appreciated, thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725951#comment-14725951 ] Sudarshan Kadambi commented on SPARK-10320: --- "it's almost certainly not the same thread". Yes, you're right. The new topic additions would happen in a different thread than the one that initialized the spark context and started the streaming context. Could you describe how the map of topic-partition and consumption offsets would be supplied? As an additional argument to createDirectStream() (callable even after the streaming context is started?) Perhaps a more complete sketch of the possible solution (even from just an end user API perspective) would help. Also, while we're looking to solve this problem in the context of Kafka, it'd be better to generalize the solution over all sorts of channels over which data can stream over. > Kafka Support new topic subscriptions without requiring restart of the > streaming context > > > Key: SPARK-10320 > URL: https://issues.apache.org/jira/browse/SPARK-10320 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Sudarshan Kadambi > > Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe > to current ones once the streaming context has been started. Restarting the > streaming context increases the latency of update handling. > Consider a streaming application subscribed to n topics. Let's say 1 of the > topics is no longer needed in streaming analytics and hence should be > dropped. We could do this by stopping the streaming context, removing that > topic from the topic list and restarting the streaming context. Since with > some DStreams such as DirectKafkaStream, the per-partition offsets are > maintained by Spark, we should be able to resume uninterrupted (I think?) > from where we left off with a minor delay. However, in instances where > expensive state initialization (from an external datastore) may be needed for > datasets published to all topics, before streaming updates can be applied to > it, it is more convenient to only subscribe or unsubcribe to the incremental > changes to the topic list. Without such a feature, updates go unprocessed for > longer than they need to be, thus affecting QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725935#comment-14725935 ] Meihua Wu commented on SPARK-8518: -- For the reference implementations, recommend we consider this R function: https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10379) UnsafeShuffleExternalSorter should preserve first page
[ https://issues.apache.org/jira/browse/SPARK-10379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10379: --- Target Version/s: 1.6.0, 1.5.1 (was: 1.5.0) > UnsafeShuffleExternalSorter should preserve first page > -- > > Key: SPARK-10379 > URL: https://issues.apache.org/jira/browse/SPARK-10379 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > {code} > 5/08/31 18:41:25 WARN TaskSetManager: Lost task 16.1 in stage 316.0 (TID > 32686, lon4-hadoopslave-b925.lon4.spotify.net): java.io.IOException: Unable > to acquire 67108864 bytes of memory > at > org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.acquireNewPageIfNecessary(UnsafeShuffleExternalSorter.java:385) > at > org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.insertRecord(UnsafeShuffleExternalSorter.java:435) > at > org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246) > at > org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:174) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10403) UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)
Davies Liu created SPARK-10403: -- Summary: UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort) Key: SPARK-10403 URL: https://issues.apache.org/jira/browse/SPARK-10403 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu UnsafeRowSerializer reply on EOF in the stream, but UnsafeRowWriter will not write EOF between partitions. {code} java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:122) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:174) at org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160) at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169) at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10394) Make GBTParams use shared "stepSize"
[ https://issues.apache.org/jira/browse/SPARK-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10394: -- Assignee: Yanbo Liang Affects Version/s: 1.5.0 Target Version/s: 1.6.0 > Make GBTParams use shared "stepSize" > > > Key: SPARK-10394 > URL: https://issues.apache.org/jira/browse/SPARK-10394 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > > GBTParams has "stepSize" as learning rate currently. > ML has shared param class "HasStepSize", GBTParams can extend from it rather > than duplicated implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10288) Add a rest client for Spark on Yarn
[ https://issues.apache.org/jira/browse/SPARK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725913#comment-14725913 ] Steve Loughran commented on SPARK-10288: Long Haul job submission. You can't currently submit work to a running cluster if the RPC channel isn't open to you, which in cloud environments means "ssh tunnel fun" or "somehow get into the cluster" > Add a rest client for Spark on Yarn > --- > > Key: SPARK-10288 > URL: https://issues.apache.org/jira/browse/SPARK-10288 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Saisai Shao > > This is a proposal to add rest client for Spark on Yarn. Currently Spark > standalone and Mesos mode can support rest way of submitting applications, > for Spark on Yarn, it still uses program way to do it. Since RM now (from > 2.6) supports rest way of submitting application, so it would be better Spark > on Yarn also support this way. > Here is the design doc > (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing). > Currently I'm working on it, working branch is > (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major > part is already finished. > Any comment is greatly appreciated, thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10324: -- Description: Following SPARK-8445, we created this master list for MLlib features we plan to have in Spark 1.6. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add `@Since("1.6.0")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * log-linear model for survival analysis (SPARK-8518) * normal equation approach for linear regression (SPARK-9834) * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) * robust linear regression with Huber loss (SPARK-3181) * vector-free L-BFGS (SPARK-10078) * tree partition by features (SPARK-3717) * bisecting k-means (SPARK-6517) * weighted instance support (SPARK-9610) ** logistic regression (SPARK-7685) ** linear regression (SPARK-9642) ** random forest (SPARK-9478) * locality sensitive hashing (LSH) (SPARK-5992) * deep learning (SPARK-2352) ** autoencoder (SPARK-4288) ** restricted Boltzmann machine (RBM) (SPARK-4251) ** convolutional neural network (stretch) * factorization machine (SPARK-7008) * local linear algebra (SPARK-6442) * distributed LU decomposition (SPARK-8514) h2. Statistics * univariate statistics as UDAFs (SPARK-10384) * bivariate statistics as UDAFs (SPARK-10385) * R-like statistics for GLMs (SPARK-9835) * online hypothesis testing (SPARK-3147) h2. Pipeline API * pipeline persistence (SPARK-6725) * ML attribute API improvements (SPARK-8515) * feature transformers (SPARK-9930) ** feature interaction (SPARK-9698) ** SQL transformer (SPARK-8345) ** ?? * test Kaggle datasets (SPARK-9941) h2. Model persistence * PMML export ** naive Bayes (SPARK-8546) ** decision tree (SPARK-8542) * model save/load ** FPGrowth (SPARK-6724) ** PrefixSpan (SPARK-10386) * code generation ** decision tree and tree ensembles (SPARK-10387) h2. Data sources * LIBSVM data source (SPARK-10117) * public dataset loader (SPARK-10388) h2. Python API for ML The main goal of Python API is to have feature parity with Scala/Java API. You can find a complete list [here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall into two major categories: * Python API for new algorithms * Python API for missing methods h2. SparkR API for ML * support more families and link functions in SparkR::glm (SPARK-9838, SPARK-9839, SPARK-9840) * better R formula support (SPARK-9681) * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837) h2. Documentation * re-organize user guide (SPARK-8517) * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751) * automatically test example cod
[jira] [Commented] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work
[ https://issues.apache.org/jira/browse/SPARK-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725911#comment-14725911 ] Sean Owen commented on SPARK-10375: --- I don't think this is a problem in the sense that you would not be setting spark.driver props in your program anyway, kind of by definition. "Fixing" it just to emit a warning entails tracking the source of properties, whether it was set in one place, overridden elsewhere, then maintaining some blacklist of properties, etc. > Setting the driver memory with SparkConf().set("spark.driver.memory","1g") > does not work > > > Key: SPARK-10375 > URL: https://issues.apache.org/jira/browse/SPARK-10375 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.3.0 > Environment: Running with yarn >Reporter: Thomas >Priority: Minor > > When running pyspark 1.3.0 with yarn, the following code has no effect: > pyspark.SparkConf().set("spark.driver.memory","1g") > The Environment tab in yarn shows that the driver has 1g, however, the > Executors tab only shows 512 M (the default value) for the driver memory. > This issue goes away when the driver memory is specified via the command line > (i.e. --driver-memory 1g) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work
[ https://issues.apache.org/jira/browse/SPARK-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725909#comment-14725909 ] Alex Rovner commented on SPARK-10375: - [~srowen] Shall we re-open? > Setting the driver memory with SparkConf().set("spark.driver.memory","1g") > does not work > > > Key: SPARK-10375 > URL: https://issues.apache.org/jira/browse/SPARK-10375 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.3.0 > Environment: Running with yarn >Reporter: Thomas >Priority: Minor > > When running pyspark 1.3.0 with yarn, the following code has no effect: > pyspark.SparkConf().set("spark.driver.memory","1g") > The Environment tab in yarn shows that the driver has 1g, however, the > Executors tab only shows 512 M (the default value) for the driver memory. > This issue goes away when the driver memory is specified via the command line > (i.e. --driver-memory 1g) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9043) Serialize key, value and combiner classes in ShuffleDependency
[ https://issues.apache.org/jira/browse/SPARK-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9043: - Target Version/s: 1.6.0 > Serialize key, value and combiner classes in ShuffleDependency > -- > > Key: SPARK-9043 > URL: https://issues.apache.org/jira/browse/SPARK-9043 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Matt Massie > > ShuffleManager implementations are currently not given type information > regarding the key, value and combiner classes. Serialization of shuffle > objects relies on them being JavaSerializable, with methods defined for > reading/writing the object or, alternatively, serialization via Kryo which > uses reflection. > Serialization systems like Avro, Thrift and Protobuf generate classes with > zero argument constructors and explicit schema information (e.g. > IndexedRecords in Avro have get, put and getSchema methods). > By serializing the key, value and combiner class names in ShuffleDependency, > shuffle implementations will have access to schema information when > registerShuffle() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9043) Serialize key, value and combiner classes in ShuffleDependency
[ https://issues.apache.org/jira/browse/SPARK-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9043: - Assignee: Matt Massie > Serialize key, value and combiner classes in ShuffleDependency > -- > > Key: SPARK-9043 > URL: https://issues.apache.org/jira/browse/SPARK-9043 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Matt Massie >Assignee: Matt Massie > > ShuffleManager implementations are currently not given type information > regarding the key, value and combiner classes. Serialization of shuffle > objects relies on them being JavaSerializable, with methods defined for > reading/writing the object or, alternatively, serialization via Kryo which > uses reflection. > Serialization systems like Avro, Thrift and Protobuf generate classes with > zero argument constructors and explicit schema information (e.g. > IndexedRecords in Avro have get, put and getSchema methods). > By serializing the key, value and combiner class names in ShuffleDependency, > shuffle implementations will have access to schema information when > registerShuffle() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10398) Migrate Spark download page to use new lua mirroring scripts
[ https://issues.apache.org/jira/browse/SPARK-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10398. --- Resolution: Fixed Fix Version/s: (was: 1.5.0) 1.5.1 1.6.0 Issue resolved by pull request 8557 [https://github.com/apache/spark/pull/8557] > Migrate Spark download page to use new lua mirroring scripts > > > Key: SPARK-10398 > URL: https://issues.apache.org/jira/browse/SPARK-10398 > Project: Spark > Issue Type: Task > Components: Project Infra >Reporter: Luciano Resende >Assignee: Luciano Resende >Priority: Minor > Fix For: 1.6.0, 1.5.1 > > Attachments: SPARK-10398 > > > From infra team : > If you refer to www.apache.org/dyn/closer.cgi, please refer to > www.apache.org/dyn/closer.lua instead from now on. > Any non-conforming CGI scripts are no longer enabled, and are all > rewritten to go to our new mirror system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9043) Serialize key, value and combiner classes in ShuffleDependency
[ https://issues.apache.org/jira/browse/SPARK-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9043: - Component/s: (was: Spark Core) Shuffle > Serialize key, value and combiner classes in ShuffleDependency > -- > > Key: SPARK-9043 > URL: https://issues.apache.org/jira/browse/SPARK-9043 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Matt Massie > > ShuffleManager implementations are currently not given type information > regarding the key, value and combiner classes. Serialization of shuffle > objects relies on them being JavaSerializable, with methods defined for > reading/writing the object or, alternatively, serialization via Kryo which > uses reflection. > Serialization systems like Avro, Thrift and Protobuf generate classes with > zero argument constructors and explicit schema information (e.g. > IndexedRecords in Avro have get, put and getSchema methods). > By serializing the key, value and combiner class names in ShuffleDependency, > shuffle implementations will have access to schema information when > registerShuffle() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies
[ https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-10370: - Component/s: (was: Spark Core) Scheduler > After a stages map outputs are registered, all running attempts should be > marked as zombies > --- > > Key: SPARK-10370 > URL: https://issues.apache.org/jira/browse/SPARK-10370 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.5.0 >Reporter: Imran Rashid > > Follow up to SPARK-5259. During stage retry, its possible for a stage to > "complete" by registering all its map output and starting the downstream > stages, before the latest task set has completed. This will result in the > earlier task set continuing to submit tasks, that are both unnecessary and > increase the chance of hitting SPARK-8029. > Spark should mark all tasks sets for a stage as zombie as soon as its map > output is registered. Note that this involves coordination between the > various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at > least) which isn't easily testable with the current setup. > To be clear, this is *not* just referring to canceling running tasks (which > may be taken care of by SPARK-2666). This is to make sure that the taskset > is marked as a zombie, to prevent submitting *new* tasks from this task set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10296) add preservesParitioning parameter to RDD.map
[ https://issues.apache.org/jira/browse/SPARK-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725878#comment-14725878 ] Esteban Donato commented on SPARK-10296: any further thought on this issue? Do you think it deserves a pull request with the enhancement? > add preservesParitioning parameter to RDD.map > - > > Key: SPARK-10296 > URL: https://issues.apache.org/jira/browse/SPARK-10296 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Esteban Donato >Priority: Minor > > It would be nice to add the Boolean parameter preservesParitioning with > default false to RDD.map method just as it is in RDD.mapPartitions method. > If you agree I can submit a pull request with this enhancement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10372) Add end-to-end tests for the scheduling code
[ https://issues.apache.org/jira/browse/SPARK-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10372: -- Target Version/s: 1.6.0 > Add end-to-end tests for the scheduling code > > > Key: SPARK-10372 > URL: https://issues.apache.org/jira/browse/SPARK-10372 > Project: Spark > Issue Type: Sub-task > Components: Scheduler, Tests >Affects Versions: 1.5.0 >Reporter: Imran Rashid >Assignee: Imran Rashid > > The current testing framework for the scheduler only tests individual classes > in isolation: {{DAGSchedulerSuite}}, {{TaskSchedulerImplSuite}}, etc. Of > course that is useful, but we are missing tests which cover the interaction > between these components. We also have larger tests which run entire spark > jobs, but that doesn't allow fine grained control of failures for verifying > spark's fault-tolerance. > Adding a framework for testing the scheduler as a whole will: > 1. Allow testing bugs which involve the interaction between multiple parts of > the scheduler, eg. SPARK-10370 > 2. Greater confidence in refactoring the scheduler as a whole. Given the > tight coordination between the components its hard to consider any > refactoring, since it would be unlikely to be covered by any tests. > 3. Make it easier to increase test coverage. Writing tests for the > {{DAGScheduler}} now requires intimate knowledge of exactly how the > components fit together -- a lot of work goes into mimicking the appropriate > behavior of the other components. Furthermore, it makes the tests harder to > understand for the un-initiated -- which parts are simulating some condition > of an external system (eg., losing an executor), and which parts are just > interaction with other parts of the scheduler (eg., task resubmission)? > These tests will allow to just work at the level of the interaction w/ the > executors -- tasks complete, tasks fail, executors are lost, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10372) Add end-to-end tests for the scheduling code
[ https://issues.apache.org/jira/browse/SPARK-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10372: -- Component/s: (was: Spark Core) Tests Scheduler > Add end-to-end tests for the scheduling code > > > Key: SPARK-10372 > URL: https://issues.apache.org/jira/browse/SPARK-10372 > Project: Spark > Issue Type: Sub-task > Components: Scheduler, Tests >Affects Versions: 1.5.0 >Reporter: Imran Rashid >Assignee: Imran Rashid > > The current testing framework for the scheduler only tests individual classes > in isolation: {{DAGSchedulerSuite}}, {{TaskSchedulerImplSuite}}, etc. Of > course that is useful, but we are missing tests which cover the interaction > between these components. We also have larger tests which run entire spark > jobs, but that doesn't allow fine grained control of failures for verifying > spark's fault-tolerance. > Adding a framework for testing the scheduler as a whole will: > 1. Allow testing bugs which involve the interaction between multiple parts of > the scheduler, eg. SPARK-10370 > 2. Greater confidence in refactoring the scheduler as a whole. Given the > tight coordination between the components its hard to consider any > refactoring, since it would be unlikely to be covered by any tests. > 3. Make it easier to increase test coverage. Writing tests for the > {{DAGScheduler}} now requires intimate knowledge of exactly how the > components fit together -- a lot of work goes into mimicking the appropriate > behavior of the other components. Furthermore, it makes the tests harder to > understand for the un-initiated -- which parts are simulating some condition > of an external system (eg., losing an executor), and which parts are just > interaction with other parts of the scheduler (eg., task resubmission)? > These tests will allow to just work at the level of the interaction w/ the > executors -- tasks complete, tasks fail, executors are lost, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10192) Test for fetch failure in a shared dependency for "skipped" stages
[ https://issues.apache.org/jira/browse/SPARK-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10192: -- Component/s: (was: Spark Core) Scheduler > Test for fetch failure in a shared dependency for "skipped" stages > -- > > Key: SPARK-10192 > URL: https://issues.apache.org/jira/browse/SPARK-10192 > Project: Spark > Issue Type: Sub-task > Components: Scheduler, Tests >Reporter: Imran Rashid >Assignee: Imran Rashid > > One confusing corner case of the DAGScheduler is when there is a shared > shuffle dependency, a job might "skip" the stage associated with that shuffle > dependency, since its already been created as part of a different stage. > This means if there is a fetch failure, the retry will technically happen as > part of a different {{Stage}} instance. > This already works, but is lacking tests, so I just plan on adding a simple > test case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10372) Add end-to-end tests for the scheduling code
[ https://issues.apache.org/jira/browse/SPARK-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10372: -- Fix Version/s: (was: 1.6.0) > Add end-to-end tests for the scheduling code > > > Key: SPARK-10372 > URL: https://issues.apache.org/jira/browse/SPARK-10372 > Project: Spark > Issue Type: Sub-task > Components: Scheduler, Tests >Affects Versions: 1.5.0 >Reporter: Imran Rashid >Assignee: Imran Rashid > > The current testing framework for the scheduler only tests individual classes > in isolation: {{DAGSchedulerSuite}}, {{TaskSchedulerImplSuite}}, etc. Of > course that is useful, but we are missing tests which cover the interaction > between these components. We also have larger tests which run entire spark > jobs, but that doesn't allow fine grained control of failures for verifying > spark's fault-tolerance. > Adding a framework for testing the scheduler as a whole will: > 1. Allow testing bugs which involve the interaction between multiple parts of > the scheduler, eg. SPARK-10370 > 2. Greater confidence in refactoring the scheduler as a whole. Given the > tight coordination between the components its hard to consider any > refactoring, since it would be unlikely to be covered by any tests. > 3. Make it easier to increase test coverage. Writing tests for the > {{DAGScheduler}} now requires intimate knowledge of exactly how the > components fit together -- a lot of work goes into mimicking the appropriate > behavior of the other components. Furthermore, it makes the tests harder to > understand for the un-initiated -- which parts are simulating some condition > of an external system (eg., losing an executor), and which parts are just > interaction with other parts of the scheduler (eg., task resubmission)? > These tests will allow to just work at the level of the interaction w/ the > executors -- tasks complete, tasks fail, executors are lost, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10372) Add end-to-end tests for the scheduler
[ https://issues.apache.org/jira/browse/SPARK-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10372: -- Summary: Add end-to-end tests for the scheduler (was: Tests for entire scheduler) > Add end-to-end tests for the scheduler > -- > > Key: SPARK-10372 > URL: https://issues.apache.org/jira/browse/SPARK-10372 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Imran Rashid >Assignee: Imran Rashid > Fix For: 1.6.0 > > > The current testing framework for the scheduler only tests individual classes > in isolation: {{DAGSchedulerSuite}}, {{TaskSchedulerImplSuite}}, etc. Of > course that is useful, but we are missing tests which cover the interaction > between these components. We also have larger tests which run entire spark > jobs, but that doesn't allow fine grained control of failures for verifying > spark's fault-tolerance. > Adding a framework for testing the scheduler as a whole will: > 1. Allow testing bugs which involve the interaction between multiple parts of > the scheduler, eg. SPARK-10370 > 2. Greater confidence in refactoring the scheduler as a whole. Given the > tight coordination between the components its hard to consider any > refactoring, since it would be unlikely to be covered by any tests. > 3. Make it easier to increase test coverage. Writing tests for the > {{DAGScheduler}} now requires intimate knowledge of exactly how the > components fit together -- a lot of work goes into mimicking the appropriate > behavior of the other components. Furthermore, it makes the tests harder to > understand for the un-initiated -- which parts are simulating some condition > of an external system (eg., losing an executor), and which parts are just > interaction with other parts of the scheduler (eg., task resubmission)? > These tests will allow to just work at the level of the interaction w/ the > executors -- tasks complete, tasks fail, executors are lost, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10192) Test for fetch failure in a shared dependency for "skipped" stages
[ https://issues.apache.org/jira/browse/SPARK-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10192: -- Issue Type: Sub-task (was: Test) Parent: SPARK-8987 > Test for fetch failure in a shared dependency for "skipped" stages > -- > > Key: SPARK-10192 > URL: https://issues.apache.org/jira/browse/SPARK-10192 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Imran Rashid >Assignee: Imran Rashid > > One confusing corner case of the DAGScheduler is when there is a shared > shuffle dependency, a job might "skip" the stage associated with that shuffle > dependency, since its already been created as part of a different stage. > This means if there is a fetch failure, the retry will technically happen as > part of a different {{Stage}} instance. > This already works, but is lacking tests, so I just plan on adding a simple > test case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org