[jira] [Created] (SPARK-17146) Add RandomizedSearch to the CrossValidator API
Manoj Kumar created SPARK-17146: --- Summary: Add RandomizedSearch to the CrossValidator API Key: SPARK-17146 URL: https://issues.apache.org/jira/browse/SPARK-17146 Project: Spark Issue Type: Improvement Reporter: Manoj Kumar Hi, I would like to add randomized search support for the Cross-Validator API. It should be quite straightforward to add with the present abstractions. Here is the proposed API: (Names are up for debate) Proposed Classes: "ParamSamplerBuilder" or a "ParamRandomizedBuilder" that returns an Array of ParamMaps Proposed Methods: "addBounds" "addSampler" "setNumIter" Code example: {code} def sampler(): Double = { Math.pow(10.0, -5 + Random.nextFloat * (5 - (-5)) } val paramGrid = new ParamRandomizedBuilder() .addSampler(lr.regParam, sampler) .addBounds(lr.elasticNetParam, 0.0, 1.0) .setNumIter(10) .build() {code} Let me know your thoughts! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17116) Allow params to be a {string, value} dict at fit time
[ https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425522#comment-15425522 ] Manoj Kumar commented on SPARK-17116: - Haha, not really. I just found it odd that setParams accepts the parameter as a string, while params at fit time is an instance of Param. > Allow params to be a {string, value} dict at fit time > - > > Key: SPARK-17116 > URL: https://issues.apache.org/jira/browse/SPARK-17116 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Manoj Kumar >Priority: Minor > > Currently, it is possible to override the default params set at constructor > time by supplying a ParamMap which is essentially a (Param: value) dict. > Looking at the codebase, it should be trivial to extend this to a (string, > value) representation. > {code} > # This hints that the maxiter param of the lr instance is modified in-place > lr = LogisticRegression(maxIter=10, regParam=0.01) > lr.fit(dataset, {lr.maxIter: 20}) > # This seems more natural. > lr.fit(dataset, {"maxIter": 20}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17118) Make examples Python3 compatible
Manoj Kumar created SPARK-17118: --- Summary: Make examples Python3 compatible Key: SPARK-17118 URL: https://issues.apache.org/jira/browse/SPARK-17118 Project: Spark Issue Type: Improvement Reporter: Manoj Kumar There are various examples that do not work in Python 3. Most of them just include modifying the print statements. (examples/src/main/python/ml/estimator_transformer_param_example.py) is one such example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17116) Allow params to be a {string, value} dict at fit time
[ https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-17116: Description: Currently, it is possible to override the default params set at constructor time by supplying a ParamMap which is essentially a (Param: value) dict. Looking at the codebase, it should be trivial to extend this to a (string, value) representation. {code} # This hints that the maxiter param of the lr instance is modified in-place lr = LogisticRegression(maxIter=10, regParam=0.01) lr.fit(dataset, {lr.maxIter: 20}) # This seems more natural. lr.fit(dataset, {"maxIter": 20}) {code} was: Currently, it is possible to override the default params set at constructor time by supplying a ParamMap which is essentially a (Param: value) dict. Looking at the codebase, it should be trivial to extend this to a (string, value) representation. {code} # This hints that the maxiter param of the lr instance is modified in-place lr = LogisticRegression(maxIter=10, regParam=0.01) lr.fit(dataset, {lr.maxIter: 20}) # This seems more natural. lr.fit(dataset, {"maxiter": 20}) {code} > Allow params to be a {string, value} dict at fit time > - > > Key: SPARK-17116 > URL: https://issues.apache.org/jira/browse/SPARK-17116 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Manoj Kumar >Priority: Minor > > Currently, it is possible to override the default params set at constructor > time by supplying a ParamMap which is essentially a (Param: value) dict. > Looking at the codebase, it should be trivial to extend this to a (string, > value) representation. > {code} > # This hints that the maxiter param of the lr instance is modified in-place > lr = LogisticRegression(maxIter=10, regParam=0.01) > lr.fit(dataset, {lr.maxIter: 20}) > # This seems more natural. > lr.fit(dataset, {"maxIter": 20}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17116) Allow params to be a {string, value} dict at fit time
[ https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425494#comment-15425494 ] Manoj Kumar edited comment on SPARK-17116 at 8/17/16 10:17 PM: --- [~josephkb] [~mlnick] [~holdenk] This is not super important, but I do think it will be helpful. was (Author: mechcoder): [~josephkb] [~mlnick] This is not super important, but I do think it will be helpful. > Allow params to be a {string, value} dict at fit time > - > > Key: SPARK-17116 > URL: https://issues.apache.org/jira/browse/SPARK-17116 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Manoj Kumar >Priority: Minor > > Currently, it is possible to override the default params set at constructor > time by supplying a ParamMap which is essentially a (Param: value) dict. > Looking at the codebase, it should be trivial to extend this to a (string, > value) representation. > {code} > # This hints that the maxiter param of the lr instance is modified in-place > lr = LogisticRegression(maxIter=10, regParam=0.01) > lr.fit(dataset, {lr.maxIter: 20}) > # This seems more natural. > lr.fit(dataset, {"maxiter": 20}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17116) Allow params to be a {string, value} dict at fit time
[ https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425494#comment-15425494 ] Manoj Kumar commented on SPARK-17116: - [~josephkb] [~mlnick] This is not super important, but I do think it will be helpful. > Allow params to be a {string, value} dict at fit time > - > > Key: SPARK-17116 > URL: https://issues.apache.org/jira/browse/SPARK-17116 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Manoj Kumar >Priority: Minor > > Currently, it is possible to override the default params set at constructor > time by supplying a ParamMap which is essentially a (Param: value) dict. > Looking at the codebase, it should be trivial to extend this to a (string, > value) representation. > {code} > # This hints that the maxiter param of the lr instance is modified in-place > lr = LogisticRegression(maxIter=10, regParam=0.01) > lr.fit(dataset, {lr.maxIter: 20}) > # This seems more natural. > lr.fit(dataset, {"maxiter": 20}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17116) Allow params to be a {string, value} dict at fit time
[ https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-17116: Summary: Allow params to be a {string, value} dict at fit time (was: Allow params to be a {string, value} dict) > Allow params to be a {string, value} dict at fit time > - > > Key: SPARK-17116 > URL: https://issues.apache.org/jira/browse/SPARK-17116 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Manoj Kumar >Priority: Minor > > Currently, it is possible to override the default params set at constructor > time by supplying a ParamMap which is essentially a (Param: value) dict. > Looking at the codebase, it should be trivial to extend this to a (string, > value) representation. > {code} > # This hints that the maxiter param of the lr instance is modified in-place > lr = LogisticRegression(maxIter=10, regParam=0.01) > lr.fit(dataset, {lr.maxIter: 20}) > # This seems more natural. > lr.fit(dataset, {"maxiter": 20}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17116) Allow params to be a {string, value} dict
Manoj Kumar created SPARK-17116: --- Summary: Allow params to be a {string, value} dict Key: SPARK-17116 URL: https://issues.apache.org/jira/browse/SPARK-17116 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Manoj Kumar Priority: Minor Currently, it is possible to override the default params set at constructor time by supplying a ParamMap which is essentially a (Param: value) dict. Looking at the codebase, it should be trivial to extend this to a (string, value) representation. {code} # This hints that the maxiter param of the lr instance is modified in-place lr = LogisticRegression(maxIter=10, regParam=0.01) lr.fit(dataset, {lr.maxIter: 20}) # This seems more natural. lr.fit(dataset, {"maxiter": 20}) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward
[ https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368602#comment-15368602 ] Manoj Kumar commented on SPARK-16365: - Could you be a bit more clearer about the first point? Is it so that people can quickly prototype locally with a small subsample of the data before doing the dataframe | RDD conversion to handle huge amounts of data? > Ideas for moving "mllib-local" forward > -- > > Key: SPARK-16365 > URL: https://issues.apache.org/jira/browse/SPARK-16365 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Nick Pentreath > > Since SPARK-13944 is all done, we should all think about what the "next > steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's > linear algebra", or "investigate how we will implement local models/pipelines > in Spark", etc. > This ticket is for comments, ideas, brainstormings and PoCs. The separation > of linalg into a standalone project turned out to be significantly more > complex than originally expected. So I vote we devote sufficient discussion > and time to planning out the next move :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory
[ https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368216#comment-15368216 ] Manoj Kumar commented on SPARK-3728: Hi [~xusen]. Are you still working on this? > RandomForest: Learn models too large to store in memory > --- > > Key: SPARK-3728 > URL: https://issues.apache.org/jira/browse/SPARK-3728 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Proposal: Write trees to disk as they are learned. > RandomForest currently uses a FIFO queue, which means training all trees at > once via breadth-first search. Using a FILO queue would encourage the code > to finish one tree before moving on to new ones. This would allow the code > to write trees to disk as they are learned. > Note: It would also be possible to write nodes to disk as they are learned > using a FIFO queue, once the example--node mapping is cached [JIRA]. The > [Sequoia Forest package]() does this. However, it could be useful to learn > trees progressively, so that future functionality such as early stopping > (training fewer trees than expected) could be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward
[ https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366978#comment-15366978 ] Manoj Kumar commented on SPARK-16365: - Is the ultimate aim to make mllib-local, the scikit-learn of scala? > Ideas for moving "mllib-local" forward > -- > > Key: SPARK-16365 > URL: https://issues.apache.org/jira/browse/SPARK-16365 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Nick Pentreath > > Since SPARK-13944 is all done, we should all think about what the "next > steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's > linear algebra", or "investigate how we will implement local models/pipelines > in Spark", etc. > This ticket is for comments, ideas, brainstormings and PoCs. The separation > of linalg into a standalone project turned out to be significantly more > complex than originally expected. So I vote we devote sufficient discussion > and time to planning out the next move :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"
[ https://issues.apache.org/jira/browse/SPARK-16399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366392#comment-15366392 ] Manoj Kumar commented on SPARK-16399: - It would just run with the default python, that is in this case python 2.6 > Set PYSPARK_PYTHON to point to "python" instead of "python2.7" > -- > > Key: SPARK-16399 > URL: https://issues.apache.org/jira/browse/SPARK-16399 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Manoj Kumar >Assignee: Manoj Kumar >Priority: Minor > Fix For: 2.1.0 > > > Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even > though higher versions of Python seem to be installed. > It should be better to force "PYSPARK_PYTHON" to python instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"
Manoj Kumar created SPARK-16399: --- Summary: Set PYSPARK_PYTHON to point to "python" instead of "python2.7" Key: SPARK-16399 URL: https://issues.apache.org/jira/browse/SPARK-16399 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Manoj Kumar Priority: Minor Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even though higher versions of Python seem to be installed. It should be better to force "PYSPARK_PYTHON" to python instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16307) Improve testing for DecisionTree variances
Manoj Kumar created SPARK-16307: --- Summary: Improve testing for DecisionTree variances Key: SPARK-16307 URL: https://issues.apache.org/jira/browse/SPARK-16307 Project: Spark Issue Type: Test Reporter: Manoj Kumar Priority: Minor The current test assumes that Impurity.calculate() returns the variance correctly. A better approach would be to test if the variance returned equals the variance that we can manually verify on a toy data and tree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16306) Improve testing for DecisionTree variances
Manoj Kumar created SPARK-16306: --- Summary: Improve testing for DecisionTree variances Key: SPARK-16306 URL: https://issues.apache.org/jira/browse/SPARK-16306 Project: Spark Issue Type: Test Reporter: Manoj Kumar Priority: Minor The current test assumes that Impurity.calculate() returns the variance correctly. A better approach would be to test if the variance returned equals the variance that we can manually verify on a toy data and tree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14351) Optimize ImpurityAggregator for decision trees
[ https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335165#comment-15335165 ] Manoj Kumar edited comment on SPARK-14351 at 6/23/16 11:43 PM: --- OK, so here are some benchmarks that validate your claims partially (All trained to maxDepth=30 and the auto feature selection strategy). The trend is that as the number of trees increase, it seems to have a higher impact. I'll see what I can optimize tomorrow. || n_tree || n_samples || n_features || totalTime || percent of total time spent in impurityCalculator || percent of total time spent in impurityStats || |1 | 1 | 500 | 7.90 | 0.328% | 0.01% |10 | 1 | 500 | 7.67 | 1.3% | 0.12% |100 | 1 | 500 | 18.156 | 5.19% | 0.29% |1 | 500 | 1 | 7.1308 | 0.39% | 0.014% |10 | 500 | 1 | 7.5506 | 1.37% | 0.12% |100 | 500 | 1 | 17.61| 6.18% | 0.349% |1 | 1000 | 1000 | 6.99 | 0.28% | 0.029% |10 | 1000 | 1000 | 7.415 | 1.7% | 0.09% |100 | 1000 | 1000 | 17.89 | 6.1% | 0.3% |500 | 1000 | 1000 | 71.02 | 6.8% | 0.3% was (Author: mechcoder): OK, so here are some benchmarks that validate your claims partially (All trained to maxDepth=30 and the auto feature selection strategy). The trend is that as the number of trees increase, it seems to have a higher impact. I'll see what I can optimize tomorrow. || n_tree || n_samples || n_features || totalTime || percent of total time spent in impurityCalculator || percent of total time spent in impurityStats || |1 | 1 | 500 | 7.90 | 0.328% | 0.01% |10 | 1 | 500 | 7.67 | 1.3% | 0.12% |100 | 1 | 500 | 18.156 | 5.19% | 0.29% 1 | 500 | 1 | 7.1308 | 0.39% | 0.014% |10 | 500 | 1 | 7.5506 | 1.37% | 0.12% |100 | 500 | 1 | 17.61| 6.18% | 0.349% |1 | 1000 | 1000 | 6.99 | 0.28% | 0.029% |10 | 1000 | 1000 | 7.415 | 1.7% | 0.09% |100 | 1000 | 1000 | 17.89 | 6.1% | 0.3% |500 | 1000 | 1000 | 71.02 | 6.8% | 0.3% > Optimize ImpurityAggregator for decision trees > -- > > Key: SPARK-14351 > URL: https://issues.apache.org/jira/browse/SPARK-14351 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > {{RandomForest.binsToBestSplit}} currently takes a large amount of time. > Based on some quick profiling, I believe a big chunk of this is spent in > {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array > copies) and {{RandomForest.calculateImpurityStats}}. > This JIRA is for: > * Doing more profiling to confirm that unnecessary time is being spent in > some of these methods. > * Optimizing the implementation > * Profiling again to confirm the speedups > Local profiling for large enough examples should suffice, especially since > the optimizations should not need to change the amount of data communicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14351) Optimize ImpurityAggregator for decision trees
[ https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335165#comment-15335165 ] Manoj Kumar edited comment on SPARK-14351 at 6/23/16 11:43 PM: --- OK, so here are some benchmarks that validate your claims partially (All trained to maxDepth=30 and the auto feature selection strategy). The trend is that as the number of trees increase, it seems to have a higher impact. I'll see what I can optimize tomorrow. || n_tree || n_samples || n_features || totalTime || percent of total time spent in impurityCalculator || percent of total time spent in impurityStats || |1 | 1 | 500 | 7.90 | 0.328% | 0.01% |10 | 1 | 500 | 7.67 | 1.3% | 0.12% |100 | 1 | 500 | 18.156 | 5.19% | 0.29% 1 | 500 | 1 | 7.1308 | 0.39% | 0.014% |10 | 500 | 1 | 7.5506 | 1.37% | 0.12% |100 | 500 | 1 | 17.61| 6.18% | 0.349% |1 | 1000 | 1000 | 6.99 | 0.28% | 0.029% |10 | 1000 | 1000 | 7.415 | 1.7% | 0.09% |100 | 1000 | 1000 | 17.89 | 6.1% | 0.3% |500 | 1000 | 1000 | 71.02 | 6.8% | 0.3% was (Author: mechcoder): OK, so here are some benchmarks that validate your claims partially (All trained to maxDepth=30 and the auto feature selection strategy). The trend is that as the number of trees increase, it seems to have a higher impact. I'll see what I can optimize tomorrow. || n_tree || n_samples || n_features || totalTime || percent in binsToBestSplit || percent in impurityCalculator || percent in impurityStatsTime || |1 | 1 | 500 | 2 | 19.5% | 15% | 0.1% |10 | 1 | 500 | 2.45 | 13% | 8.5%| 0.7% |100 | 1 | 500 | 4.48 | 64.5% | 41.5% | 2.1% |500 | 1 | 500 | 15.2 | 89.6% | 61.1% | 3.4% |1 | 500 | 1 | 2.16 | 18.5% | 16.2% | ~ |10 | 500 | 1 | 2.70 | 14.8% | 11.1%| 0.4% |100 | 500 | 1 | 9.07 | 43.5% | 31.4% | 1.9% |1 | 1000 | 1000 | 2.02 | 24.7% | 14.8% | 0.2% |10 | 1000 | 1000 | 6.2 | 12.8% | 9.6%| 0.1% |50 | 1000 | 1000 | 4.05 | 38.5% | 28.8% | 2.8% |100 | 1000 | 1000 | 10.19 | 45.3% | 30.6% | 3.18% > Optimize ImpurityAggregator for decision trees > -- > > Key: SPARK-14351 > URL: https://issues.apache.org/jira/browse/SPARK-14351 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > {{RandomForest.binsToBestSplit}} currently takes a large amount of time. > Based on some quick profiling, I believe a big chunk of this is spent in > {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array > copies) and {{RandomForest.calculateImpurityStats}}. > This JIRA is for: > * Doing more profiling to confirm that unnecessary time is being spent in > some of these methods. > * Optimizing the implementation > * Profiling again to confirm the speedups > Local profiling for large enough examples should suffice, especially since > the optimizations should not need to change the amount of data communicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14351) Optimize ImpurityAggregator for decision trees
[ https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-14351: Comment: was deleted (was: Here are my thoughts: Also ccing [~sethah] since he has seen this part of the codebase quite a few times to get some ideas. Right now, a copy of size *statsSize* is created that is sliced from *allStats* for every instantiation of *ImpurityCalculator*. The reason to do this is because there are calls to *add* and *subtract* that modify the *ImpurityCalculator* inplace. However the calls to *add* or *subtract* are very less as compared to the class instantiations. (roughly called one time for every 2 times *ImpurityCalculator* is instantiated. I see two alternatives. 1. Pass the view directly to the *ImpurityCalculator* and make a copy whenever *add* or *subtract* is called. 2. Pass *allStats*, *offset*, *offset + statsSize* to the *impurityCalculator* and make a copy of *allStats* whenever *add* or *subtract* is called. Both will involve making *stats* a def, which would provide a copy whenever it is being called. The first one is more more favourable because the size of *allStats* is huge. WDYT?) > Optimize ImpurityAggregator for decision trees > -- > > Key: SPARK-14351 > URL: https://issues.apache.org/jira/browse/SPARK-14351 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > {{RandomForest.binsToBestSplit}} currently takes a large amount of time. > Based on some quick profiling, I believe a big chunk of this is spent in > {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array > copies) and {{RandomForest.calculateImpurityStats}}. > This JIRA is for: > * Doing more profiling to confirm that unnecessary time is being spent in > some of these methods. > * Optimizing the implementation > * Profiling again to confirm the speedups > Local profiling for large enough examples should suffice, especially since > the optimizations should not need to change the amount of data communicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14351) Optimize ImpurityAggregator for decision trees
[ https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345182#comment-15345182 ] Manoj Kumar commented on SPARK-14351: - Here are my thoughts: Also ccing [~sethah] since he has seen this part of the codebase quite a few times to get some ideas. Right now, a copy of size *statsSize* is created that is sliced from *allStats* for every instantiation of *ImpurityCalculator*. The reason to do this is because there are calls to *add* and *subtract* that modify the *ImpurityCalculator* inplace. However the calls to *add* or *subtract* are very less as compared to the class instantiations. (roughly called one time for every 2 times *ImpurityCalculator* is instantiated. I see two alternatives. 1. Pass the view directly to the *ImpurityCalculator* and make a copy whenever *add* or *subtract* is called. 2. Pass *allStats*, *offset*, *offset + statsSize* to the *impurityCalculator* and make a copy of *allStats* whenever *add* or *subtract* is called. Both will involve making *stats* a def, which would provide a copy whenever it is being called. The first one is more more favourable because the size of *allStats* is huge. WDYT? > Optimize ImpurityAggregator for decision trees > -- > > Key: SPARK-14351 > URL: https://issues.apache.org/jira/browse/SPARK-14351 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > {{RandomForest.binsToBestSplit}} currently takes a large amount of time. > Based on some quick profiling, I believe a big chunk of this is spent in > {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array > copies) and {{RandomForest.calculateImpurityStats}}. > This JIRA is for: > * Doing more profiling to confirm that unnecessary time is being spent in > some of these methods. > * Optimizing the implementation > * Profiling again to confirm the speedups > Local profiling for large enough examples should suffice, especially since > the optimizations should not need to change the amount of data communicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14351) Optimize ImpurityAggregator for decision trees
[ https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335165#comment-15335165 ] Manoj Kumar commented on SPARK-14351: - OK, so here are some benchmarks that validate your claims partially (All trained to maxDepth=30 and the auto feature selection strategy). The trend is that as the number of trees increase, it seems to have a higher impact. I'll see what I can optimize tomorrow. || n_tree || n_samples || n_features || totalTime || percent in binsToBestSplit || percent in impurityCalculator || percent in impurityStatsTime || |1 | 1 | 500 | 2 | 19.5% | 15% | 0.1% |10 | 1 | 500 | 2.45 | 13% | 8.5%| 0.7% |100 | 1 | 500 | 4.48 | 64.5% | 41.5% | 2.1% |500 | 1 | 500 | 15.2 | 89.6% | 61.1% | 3.4% |1 | 500 | 1 | 2.16 | 18.5% | 16.2% | ~ |10 | 500 | 1 | 2.70 | 14.8% | 11.1%| 0.4% |100 | 500 | 1 | 9.07 | 43.5% | 31.4% | 1.9% |1 | 1000 | 1000 | 2.02 | 24.7% | 14.8% | 0.2% |10 | 1000 | 1000 | 6.2 | 12.8% | 9.6%| 0.1% |50 | 1000 | 1000 | 4.05 | 38.5% | 28.8% | 2.8% |100 | 1000 | 1000 | 10.19 | 45.3% | 30.6% | 3.18% > Optimize ImpurityAggregator for decision trees > -- > > Key: SPARK-14351 > URL: https://issues.apache.org/jira/browse/SPARK-14351 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > {{RandomForest.binsToBestSplit}} currently takes a large amount of time. > Based on some quick profiling, I believe a big chunk of this is spent in > {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array > copies) and {{RandomForest.calculateImpurityStats}}. > This JIRA is for: > * Doing more profiling to confirm that unnecessary time is being spent in > some of these methods. > * Optimizing the implementation > * Profiling again to confirm the speedups > Local profiling for large enough examples should suffice, especially since > the optimizations should not need to change the amount of data communicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14351) Optimize ImpurityAggregator for decision trees
[ https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328941#comment-15328941 ] Manoj Kumar commented on SPARK-14351: - I can try working on this. > Optimize ImpurityAggregator for decision trees > -- > > Key: SPARK-14351 > URL: https://issues.apache.org/jira/browse/SPARK-14351 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > {{RandomForest.binsToBestSplit}} currently takes a large amount of time. > Based on some quick profiling, I believe a big chunk of this is spent in > {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array > copies) and {{RandomForest.calculateImpurityStats}}. > This JIRA is for: > * Doing more profiling to confirm that unnecessary time is being spent in > some of these methods. > * Optimizing the implementation > * Profiling again to confirm the speedups > Local profiling for large enough examples should suffice, especially since > the optimizations should not need to change the amount of data communicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328939#comment-15328939 ] Manoj Kumar edited comment on SPARK-3155 at 6/14/16 5:01 AM: - 1. I agree that the use cases are limited to single trees. You kind of lose interpretability if you train the tree to maximum depth. It helps in improving interpretability while also improving on generalization performance. 3. It is intuitive to prune the tree during training (i.e stop training after the validation error increases) . However this is very similar to just having a stopping criterion such as maximum depth, minimum samples in each node (except that the stopping criteria is dependent on validation data) And is quite uncommon to do it. The standard practise (at least according to my lectures) is to train the tree to full depth and remove the leaves according to validation data. However, if you feel that #14351 is more important, I can focus on that. was (Author: mechcoder): 1. I agree that the use cases are limited to single trees. You kind of lose interpretability if you train the tree to maximum depth. It helps in improving interpretability while also improving on generalization performance. 3. It is intuitive to prune the tree during training (i.e stop training after the validation error increases) . However this is very similar to just having a stopping criterion such as maximum depth, minimum samples in each node (except that the stopping criteria is dependent on validation data) And is quite uncommon to do it. The standard practise (at least according to my lectures) is to train the train to full depth and remove the leaves according to validation data. However, if you feel that #14351 is more important, I can focus on that. > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328939#comment-15328939 ] Manoj Kumar commented on SPARK-3155: 1. I agree that the use cases are limited to single trees. You kind of lose interpretability if you train the tree to maximum depth. It helps in improving interpretability while also improving on generalization performance. 3. It is intuitive to prune the tree during training (i.e stop training after the validation error increases) . However this is very similar to just having a stopping criterion such as maximum depth, minimum samples in each node (except that the stopping criteria is dependent on validation data) And is quite uncommon to do it. The standard practise (at least according to my lectures) is to train the train to full depth and remove the leaves according to validation data. However, if you feel that #14351 is more important, I can focus on that. > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328592#comment-15328592 ] Manoj Kumar commented on SPARK-3155: I would like to add support for pruning DecisionTrees as part of my internship. Some API related questions: Support for DecisionTree pruning in R is done in this way: prune(fit, cp=) A very straightforward extension would be to start would be to: model.prune(validationData, errorTol=) where model is a fit DecisionTreeRegressionModel would stop pruning when the improvement in error is not above a certain tolerance. Does that sound like a good idea? > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-3155: --- Comment: was deleted (was: I would like to add support for pruning DecisionTrees as part of my internship. Some API related questions: Support for DecisionTree pruning in R is done in this way: prune(fit, cp=) A very straightforward extension would be to start would be to: model.prune(validationData, errorTol=) where model is a fit DecisionTreeRegressionModel would stop pruning when the improvement in error is not above a certain tolerance. Does that sound like a good idea? ) > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328487#comment-15328487 ] Manoj Kumar commented on SPARK-3155: I would like to add support for pruning DecisionTrees as part of my internship. Some API related questions: Support for DecisionTree pruning in R is done in this way: prune(fit, cp=) A very straightforward extension would be to start would be to: model.prune(validationData, errorTol=) where model is a fit DecisionTreeRegressionModel would stop pruning when the improvement in error is not above a certain tolerance. Does that sound like a good idea? > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9623) RandomForestRegressor: provide variance of predictions
[ https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319450#comment-15319450 ] Manoj Kumar commented on SPARK-9623: [~yanboliang] Are you still working on this? Would you mind if I take over? > RandomForestRegressor: provide variance of predictions > -- > > Key: SPARK-9623 > URL: https://issues.apache.org/jira/browse/SPARK-9623 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Variance of predicted value, as estimated from training data. > Analogous to class probabilities for classification. > See [SPARK-3727] for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15761) pyspark shell should load if PYSPARK_DRIVER_PYTHON is ipython an Python3
Manoj Kumar created SPARK-15761: --- Summary: pyspark shell should load if PYSPARK_DRIVER_PYTHON is ipython an Python3 Key: SPARK-15761 URL: https://issues.apache.org/jira/browse/SPARK-15761 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Manoj Kumar Priority: Minor My default python is ipython3 and it is odd that it fails with "IPython requires Python 2.7+; please install python2.7 or set PYSPARK_PYTHON" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706722#comment-14706722 ] Manoj Kumar commented on SPARK-6192: [~rxin] It gets over in a few hours from now. I have written a blog post summarizing my work done during this summer. https://manojbits.wordpress.com/2015/08/21/google-summer-of-code-wrapup/ I think this can be marked as resolved :) cc: [~josephkb] Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9848) Add @Since annotation to new public APIs in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703133#comment-14703133 ] Manoj Kumar commented on SPARK-9848: Do we want to tag spark.ml in this release as well? Add @Since annotation to new public APIs in 1.5 --- Key: SPARK-9848 URL: https://issues.apache.org/jira/browse/SPARK-9848 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Xiangrui Meng Assignee: Manoj Kumar Priority: Critical Labels: starter We should get a list of new APIs from SPARK-9660. cc: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9848) Add @since tag to new public APIs in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702808#comment-14702808 ] Manoj Kumar commented on SPARK-9848: Well, actually in the JIRA linked to I could find no new additions. There is only a single change that marks an aux constructor of tree as private. Can we mark this as resolved? Add @since tag to new public APIs in 1.5 Key: SPARK-9848 URL: https://issues.apache.org/jira/browse/SPARK-9848 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Xiangrui Meng Assignee: Manoj Kumar Priority: Critical Labels: starter We should get a list of new APIs from SPARK-9660. cc: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib
[ https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702816#comment-14702816 ] Manoj Kumar commented on SPARK-7751: Did you forget to add mllib.feature? Add @since to stable and experimental methods in MLlib -- Key: SPARK-7751 URL: https://issues.apache.org/jira/browse/SPARK-7751 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Labels: starter This is useful to check whether a feature exists in some version of Spark. This is an umbrella JIRA to track the progress. We want to have @since tag for both stable (those without any Experimental/DeveloperApi/AlphaComponent annotations) and experimental methods in MLlib: (Do NOT tag private or package private classes or methods.) * an example PR for Scala: https://github.com/apache/spark/pull/6101 * an example PR for Python: https://github.com/apache/spark/pull/6295 We need to dig the history of git commit to figure out what was the Spark version when a method was first introduced. Take `NaiveBayes.setModelType` as an example. We can grep `def setModelType` at different version git tags. {code} meng@xm:~/src/spark $ git show v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType meng@xm:~/src/spark $ git show v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType def setModelType(modelType: String): NaiveBayes = { {code} If there are better ways, please let us know. We cannot add all @since tags in a single PR, which is hard to review. So we made some subtasks for each package, for example `org.apache.spark.classification`. Feel free to add more sub-tasks for Python and the `spark.ml` package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10108) Add @since tags to mllib.feature
Manoj Kumar created SPARK-10108: --- Summary: Add @since tags to mllib.feature Key: SPARK-10108 URL: https://issues.apache.org/jira/browse/SPARK-10108 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Manoj Kumar Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10082) Validate i, j in apply (Dense and Sparse Matrices)
[ https://issues.apache.org/jira/browse/SPARK-10082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-10082: Component/s: MLlib Validate i, j in apply (Dense and Sparse Matrices) -- Key: SPARK-10082 URL: https://issues.apache.org/jira/browse/SPARK-10082 Project: Spark Issue Type: Bug Components: MLlib Reporter: Manoj Kumar Priority: Minor Given row_ind should be less than the number of rows Given col_ind should be less than the number of cols. The current code in master gives unpredictable behavior for such cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9911) User guide for MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700802#comment-14700802 ] Manoj Kumar commented on SPARK-9911: Umm. What additional advantage does the MulticlassClassificationEvaluator (or the Evaluator abstract class) have on top of the evaluate methods planned to be added in all the transformer models? (eg in LogisticRegression and LinearRegression as of now) cc: [~josephkb] User guide for MulticlassClassificationEvaluator Key: SPARK-9911 URL: https://issues.apache.org/jira/browse/SPARK-9911 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Manoj Kumar SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm Guides}} to document this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10082) Validate i, j in apply (Dense and Sparse Matrices)
Manoj Kumar created SPARK-10082: --- Summary: Validate i, j in apply (Dense and Sparse Matrices) Key: SPARK-10082 URL: https://issues.apache.org/jira/browse/SPARK-10082 Project: Spark Issue Type: Bug Reporter: Manoj Kumar Priority: Minor Given row_ind should be less than the number of rows Given col_ind should be less than the number of cols. The current code in master gives unpredictable behavior for such cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9911) User guide for MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701573#comment-14701573 ] Manoj Kumar commented on SPARK-9911: Ah I see. Thanks for the clarification. Where should the user guide for these evaluators go? I do not see any user-guide for BinaryClassificationEvaluator as well. Or should I just add an example in docs/ml-guide.md to tune a problem with a MulticlassClassificationEvaluator? User guide for MulticlassClassificationEvaluator Key: SPARK-9911 URL: https://issues.apache.org/jira/browse/SPARK-9911 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Manoj Kumar SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm Guides}} to document this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9911) User guide for MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701572#comment-14701572 ] Manoj Kumar commented on SPARK-9911: Ah I see. Thanks for the clarification. Where should the user guide for these evaluators go? I do not see any user-guide for BinaryClassificationEvaluator as well. Or should I just add an example in docs/ml-guide.md to tune a problem with a MulticlassClassificationEvaluator? User guide for MulticlassClassificationEvaluator Key: SPARK-9911 URL: https://issues.apache.org/jira/browse/SPARK-9911 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Manoj Kumar SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm Guides}} to document this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9911) User guide for MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-9911: --- Comment: was deleted (was: Ah I see. Thanks for the clarification. Where should the user guide for these evaluators go? I do not see any user-guide for BinaryClassificationEvaluator as well. Or should I just add an example in docs/ml-guide.md to tune a problem with a MulticlassClassificationEvaluator?) User guide for MulticlassClassificationEvaluator Key: SPARK-9911 URL: https://issues.apache.org/jira/browse/SPARK-9911 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Manoj Kumar SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm Guides}} to document this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9911) User guide for MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698841#comment-14698841 ] Manoj Kumar commented on SPARK-9911: Can I work on this? User guide for MulticlassClassificationEvaluator Key: SPARK-9911 URL: https://issues.apache.org/jira/browse/SPARK-9911 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm Guides}} to document this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9906) User guide for LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695156#comment-14695156 ] Manoj Kumar commented on SPARK-9906: Sure ! User guide for LogisticRegressionSummary Key: SPARK-9906 URL: https://issues.apache.org/jira/browse/SPARK-9906 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model statistics to ML pipeline logistic regression models. This feature is not present in mllib and should be documented within {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6364) hashCode and equals for Matrices
[ https://issues.apache.org/jira/browse/SPARK-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695468#comment-14695468 ] Manoj Kumar commented on SPARK-6364: It is all right, there is enough work for everybody :P hashCode and equals for Matrices Key: SPARK-6364 URL: https://issues.apache.org/jira/browse/SPARK-6364 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Manoj Kumar hashCode implementation should be similar to Vector's. But we may want to reduce the complexity by scanning only a few nonzeros instead of all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9919) Matrices should respect Java's equals and hashCode contract
[ https://issues.apache.org/jira/browse/SPARK-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695491#comment-14695491 ] Manoj Kumar commented on SPARK-9919: OK, but I need to make some changes. I'll resubmit the PR in some time. Matrices should respect Java's equals and hashCode contract --- Key: SPARK-9919 URL: https://issues.apache.org/jira/browse/SPARK-9919 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Priority: Critical The contract for Java's Object is that a.equals(b) implies a.hashCode == b.hashCode. So usually we need to implement both. The problem with hashCode is that we shouldn't compute it based on all values, which could be very expensive. You can use the implementation of Vector.hashCode as a template, but that requires some changes to avoid hash code collisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9919) Matrices should respect Java's equals and hashCode contract
[ https://issues.apache.org/jira/browse/SPARK-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695492#comment-14695492 ] Manoj Kumar commented on SPARK-9919: OK, but I need to make some changes. I'll resubmit the PR in some time. Matrices should respect Java's equals and hashCode contract --- Key: SPARK-9919 URL: https://issues.apache.org/jira/browse/SPARK-9919 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Priority: Critical The contract for Java's Object is that a.equals(b) implies a.hashCode == b.hashCode. So usually we need to implement both. The problem with hashCode is that we shouldn't compute it based on all values, which could be very expensive. You can use the implementation of Vector.hashCode as a template, but that requires some changes to avoid hash code collisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8633) List missing model methods in Python Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar resolved SPARK-8633. Resolution: Fixed List missing model methods in Python Pipeline API - Key: SPARK-8633 URL: https://issues.apache.org/jira/browse/SPARK-8633 Project: Spark Issue Type: Task Components: ML, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Most Python models under the pipeline API are implemented as JavaModel wrappers. However, we didn't provide methods to extract information from model. In SPARK-7647, we added weights and intercept to linear models. This JIRA is to list all missing model methods, create JIRAs for each, and link them here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8633) List missing model methods in Python Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660372#comment-14660372 ] Manoj Kumar commented on SPARK-8633: Should I mark this as resolved? List missing model methods in Python Pipeline API - Key: SPARK-8633 URL: https://issues.apache.org/jira/browse/SPARK-8633 Project: Spark Issue Type: Task Components: ML, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Most Python models under the pipeline API are implemented as JavaModel wrappers. However, we didn't provide methods to extract information from model. In SPARK-7647, we added weights and intercept to linear models. This JIRA is to list all missing model methods, create JIRAs for each, and link them here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658600#comment-14658600 ] Manoj Kumar commented on SPARK-6488: I'll create a JIRA in a while. I am just adding support for multiply method in RowMatrix since it goes with the PCA example (along with PCA and SVD). Others can be done by you :) Thanks! Support addition/multiplication in PySpark's BlockMatrix Key: SPARK-6488 URL: https://issues.apache.org/jira/browse/SPARK-6488 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We should reuse the Scala implementation instead of having a separate implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9655) Add missing methods to linalg.distributed
Manoj Kumar created SPARK-9655: -- Summary: Add missing methods to linalg.distributed Key: SPARK-9655 URL: https://issues.apache.org/jira/browse/SPARK-9655 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Manoj Kumar Missing methods in linalg.distributed RowMatrix 1. computeGramianMatrix 2. computeCovariance 3. computeColumnSummaryStatistics 4. columnSimilarities 5. tallSkinnyQR IndexedRowMatrix 1. computeGramianMatrix() CoordinateMatrix 1. transpose() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9655) Add missing methods to linalg.distributed
[ https://issues.apache.org/jira/browse/SPARK-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar closed SPARK-9655. -- Resolution: Duplicate Add missing methods to linalg.distributed - Key: SPARK-9655 URL: https://issues.apache.org/jira/browse/SPARK-9655 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Manoj Kumar Missing methods in linalg.distributed RowMatrix 1. computeGramianMatrix 2. computeCovariance 3. computeColumnSummaryStatistics 4. columnSimilarities 5. tallSkinnyQR IndexedRowMatrix 1. computeGramianMatrix() CoordinateMatrix 1. transpose() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9656) Add missing methods to linalg.distributed
[ https://issues.apache.org/jira/browse/SPARK-9656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658937#comment-14658937 ] Manoj Kumar commented on SPARK-9656: cc: [~mwdus...@us.ibm.com] Add missing methods to linalg.distributed - Key: SPARK-9656 URL: https://issues.apache.org/jira/browse/SPARK-9656 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Manoj Kumar Missing methods in linalg.distributed RowMatrix 1. computeGramianMatrix 2. computeCovariance 3. computeColumnSummaryStatistics 4. columnSimilarities 5. tallSkinnyQR IndexedRowMatrix 1. computeGramianMatrix() CoordinateMatrix 1. transpose() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9656) Add missing methods to linalg.distributed
Manoj Kumar created SPARK-9656: -- Summary: Add missing methods to linalg.distributed Key: SPARK-9656 URL: https://issues.apache.org/jira/browse/SPARK-9656 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Manoj Kumar Missing methods in linalg.distributed RowMatrix 1. computeGramianMatrix 2. computeCovariance 3. computeColumnSummaryStatistics 4. columnSimilarities 5. tallSkinnyQR IndexedRowMatrix 1. computeGramianMatrix() CoordinateMatrix 1. transpose() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9533) Add missing methods in Word2Vec ML (Python API)
[ https://issues.apache.org/jira/browse/SPARK-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-9533: --- Component/s: PySpark Add missing methods in Word2Vec ML (Python API) --- Key: SPARK-9533 URL: https://issues.apache.org/jira/browse/SPARK-9533 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Manoj Kumar Priority: Minor After 8874 is resolved, we can add python wrappers for the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9484) Word2Vec import/export for original binary format
[ https://issues.apache.org/jira/browse/SPARK-9484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651899#comment-14651899 ] Manoj Kumar commented on SPARK-9484: I just went through the C code that does the .bin reading. What would be the best way to go about this? The codepaths should be almost completely different if path.endsWith(.bin) or not right? Also should this use the SaveLoadV1_0 object, or should we have a different object (say SaveLoadBinary) which would keep the codepaths independent and help easier maintenance? Word2Vec import/export for original binary format - Key: SPARK-9484 URL: https://issues.apache.org/jira/browse/SPARK-9484 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor It would be nice to add model import/export for Word2Vec which handles the original binary format used by [https://code.google.com/p/word2vec/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8874) Add missing methods in Word2Vec ML
[ https://issues.apache.org/jira/browse/SPARK-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650619#comment-14650619 ] Manoj Kumar commented on SPARK-8874: Done. Thanks. Add missing methods in Word2Vec ML -- Key: SPARK-8874 URL: https://issues.apache.org/jira/browse/SPARK-8874 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar Assignee: Manoj Kumar Add getVectors and findSynonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9533) Add missing methods in Word2Vec ML (Python API)
Manoj Kumar created SPARK-9533: -- Summary: Add missing methods in Word2Vec ML (Python API) Key: SPARK-9533 URL: https://issues.apache.org/jira/browse/SPARK-9533 Project: Spark Issue Type: Improvement Reporter: Manoj Kumar After 8874 is resolved, we can add python wrappers for the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650314#comment-14650314 ] Manoj Kumar commented on SPARK-6227: [~mengxr] Can this be assigned to me? Since the blockmatrix PR is already worked on. PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9525) Optimize SparseVector initializations in linalg
[ https://issues.apache.org/jira/browse/SPARK-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-9525: --- Priority: Major (was: Minor) Optimize SparseVector initializations in linalg --- Key: SPARK-9525 URL: https://issues.apache.org/jira/browse/SPARK-9525 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar 1. Remove sorting of indices and assume that the user gives a sorted tuple of indices, values etc 2. Avoid iterating twice to get the indices and values if the argument provided is a dict. 3. Add checks such that the length of the indices should be less than the size provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9525) Optimize SparseVector initializations in linalg
Manoj Kumar created SPARK-9525: -- Summary: Optimize SparseVector initializations in linalg Key: SPARK-9525 URL: https://issues.apache.org/jira/browse/SPARK-9525 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor 1. Remove sorting of indices and assume that the user gives a sorted tuple of indices, values etc 2. Avoid iterating twice to get the indices and values if the argument provided is a dict. 3. Add checks such that the length of the indices should be less than the size provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length
[ https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647380#comment-14647380 ] Manoj Kumar commented on SPARK-9277: I will not have access to a development environment till Saturday. Feel free to fix it. Thanks. SparseVector constructor must throw an error when declared number of elements less than array length Key: SPARK-9277 URL: https://issues.apache.org/jira/browse/SPARK-9277 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Andrey Vykhodtsev Priority: Minor Labels: starter Attachments: SparseVector test.html, SparseVector test.ipynb I found that one can create SparseVector inconsistently and it will lead to an Java error in runtime, for example when training LogisticRegressionWithSGD. Here is the test case: In [2]: sc.version Out[2]: u'1.3.1' In [13]: from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD In [3]: x = SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5}) In [10]: l = LabeledPoint(0, x) In [12]: r = sc.parallelize([l]) In [14]: m = LogisticRegressionWithSGD.train(r) Error: Py4JJavaError: An error occurred while calling o86.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 Attached is the notebook with the scenario and the full message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9408) Refactor mllib/linalg.py to mllib/linalg
Manoj Kumar created SPARK-9408: -- Summary: Refactor mllib/linalg.py to mllib/linalg Key: SPARK-9408 URL: https://issues.apache.org/jira/browse/SPARK-9408 Project: Spark Issue Type: Task Components: MLlib, PySpark Reporter: Manoj Kumar We need to refactor mllib/linalg.py to mllib/linalg so that the project structure is similar to that of Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array lenght
[ https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640055#comment-14640055 ] Manoj Kumar commented on SPARK-9277: I have labelled this as started. Will fix this in 4 days, if no one comes forward by then. SparseVector constructor must throw an error when declared number of elements less than array lenght Key: SPARK-9277 URL: https://issues.apache.org/jira/browse/SPARK-9277 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Andrey Vykhodtsev Priority: Minor Labels: starter Attachments: SparseVector test.html, SparseVector test.ipynb I found that one can create SparseVector inconsistently and it will lead to an Java error in runtime, for example when training LogisticRegressionWithSGD. Here is the test case: In [2]: sc.version Out[2]: u'1.3.1' In [13]: from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD In [3]: x = SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5}) In [10]: l = LabeledPoint(0, x) In [12]: r = sc.parallelize([l]) In [14]: m = LogisticRegressionWithSGD.train(r) Error: Py4JJavaError: An error occurred while calling o86.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 Attached is the notebook with the scenario and the full message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array lenght
[ https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-9277: --- Labels: starter (was: ) SparseVector constructor must throw an error when declared number of elements less than array lenght Key: SPARK-9277 URL: https://issues.apache.org/jira/browse/SPARK-9277 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Andrey Vykhodtsev Priority: Minor Labels: starter Attachments: SparseVector test.html, SparseVector test.ipynb I found that one can create SparseVector inconsistently and it will lead to an Java error in runtime, for example when training LogisticRegressionWithSGD. Here is the test case: In [2]: sc.version Out[2]: u'1.3.1' In [13]: from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD In [3]: x = SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5}) In [10]: l = LabeledPoint(0, x) In [12]: r = sc.parallelize([l]) In [14]: m = LogisticRegressionWithSGD.train(r) Error: Py4JJavaError: An error occurred while calling o86.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 Attached is the notebook with the scenario and the full message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7105) Support model save/load in Python's GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-7105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635424#comment-14635424 ] Manoj Kumar commented on SPARK-7105: Hi, Are you still working on this? Support model save/load in Python's GaussianMixture --- Key: SPARK-7105 URL: https://issues.apache.org/jira/browse/SPARK-7105 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yu Ishikawa Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9223) Support model save/load in Python's LDA
Manoj Kumar created SPARK-9223: -- Summary: Support model save/load in Python's LDA Key: SPARK-9223 URL: https://issues.apache.org/jira/browse/SPARK-9223 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9222) Make class instantiation variables in DistributedLDAModel [private] clustering
Manoj Kumar created SPARK-9222: -- Summary: Make class instantiation variables in DistributedLDAModel [private] clustering Key: SPARK-9222 URL: https://issues.apache.org/jira/browse/SPARK-9222 Project: Spark Issue Type: Test Components: MLlib Reporter: Manoj Kumar Priority: Minor This would enable testing the various class variables like docConcentration, topicConcentration etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6486) Add BlockMatrix in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631182#comment-14631182 ] Manoj Kumar commented on SPARK-6486: Great, I will start on this after the weekend. Add BlockMatrix in PySpark -- Key: SPARK-6486 URL: https://issues.apache.org/jira/browse/SPARK-6486 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng We should add BlockMatrix to PySpark. Internally, we can use DataFrames and MatrixUDT for serialization. This JIRA should contain conversions between IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover linear algebra operations of block matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9112) Implement LogisticRegressionSummary similar to LinearRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630053#comment-14630053 ] Manoj Kumar commented on SPARK-9112: Yes, that is the idea. Also we need not port it to ML right now, we could convert the transformed the dataframe to the required input type in mllib Also it might be useful to return the probability for the predicted class (as done by predict_proba in scikit-learn) . how does that sound? Implement LogisticRegressionSummary similar to LinearRegressionSummary -- Key: SPARK-9112 URL: https://issues.apache.org/jira/browse/SPARK-9112 Project: Spark Issue Type: New Feature Components: ML Reporter: Manoj Kumar Priority: Minor Since the API for LinearRegressionSummary has been merged, other models should follow suit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6001) K-Means clusterer should return the assignments of input points to clusters
[ https://issues.apache.org/jira/browse/SPARK-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629808#comment-14629808 ] Manoj Kumar commented on SPARK-6001: I just started to work on this. K-Means clusterer should return the assignments of input points to clusters --- Key: SPARK-6001 URL: https://issues.apache.org/jira/browse/SPARK-6001 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: Derrick Burns Priority: Minor The K-Means clusterer returns a KMeansModel that contains the cluster centers. However, when available, I suggest that the K-Means clusterer also return an RDD of the assignments of the input data to the clusters. While the assignments can be computed given the KMeansModel, why not return assignments if they are available to save re-computation costs. The K-means implementation at https://github.com/derrickburns/generalized-kmeans-clustering returns the assignments when available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9112) Implement LogisticRegressionSummary similar to LinearRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630014#comment-14630014 ] Manoj Kumar commented on SPARK-9112: Indeed, it seems so. But the merged LinearRegressionSummary also has just RegressionMetrics (unless there is anything that I missed.) By the way, I'm not sure how to set the priority labels, sorry if it is wrong (the umbrella JIRA has a critical priority, so I though that this might qualify as major) Implement LogisticRegressionSummary similar to LinearRegressionSummary -- Key: SPARK-9112 URL: https://issues.apache.org/jira/browse/SPARK-9112 Project: Spark Issue Type: New Feature Components: ML Reporter: Manoj Kumar Priority: Minor Since the API for LinearRegressionSummary has been merged, other models should follow suit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6001) K-Means clusterer should return the assignments of input points to clusters
[ https://issues.apache.org/jira/browse/SPARK-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629981#comment-14629981 ] Manoj Kumar commented on SPARK-6001: Oops. I just figured out we do not have a KMeans yet in spark.ml K-Means clusterer should return the assignments of input points to clusters --- Key: SPARK-6001 URL: https://issues.apache.org/jira/browse/SPARK-6001 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: Derrick Burns Priority: Minor The K-Means clusterer returns a KMeansModel that contains the cluster centers. However, when available, I suggest that the K-Means clusterer also return an RDD of the assignments of the input data to the clusters. While the assignments can be computed given the KMeansModel, why not return assignments if they are available to save re-computation costs. The K-means implementation at https://github.com/derrickburns/generalized-kmeans-clustering returns the assignments when available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9112) Implement LogisticRegressionSummary similar to LinearRegressionSummary
Manoj Kumar created SPARK-9112: -- Summary: Implement LogisticRegressionSummary similar to LinearRegressionSummary Key: SPARK-9112 URL: https://issues.apache.org/jira/browse/SPARK-9112 Project: Spark Issue Type: New Feature Components: ML Reporter: Manoj Kumar Since the API for LinearRegressionSummary has been merged, other models should follow suit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9112) Implement LogisticRegressionSummary similar to LinearRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630132#comment-14630132 ] Manoj Kumar commented on SPARK-9112: I see that these are fields in the transformed dataframe already (the raw predictions, probablity etc). I think just wrapping up the metrics should suffice. I'll send a PR in a while Implement LogisticRegressionSummary similar to LinearRegressionSummary -- Key: SPARK-9112 URL: https://issues.apache.org/jira/browse/SPARK-9112 Project: Spark Issue Type: New Feature Components: ML Reporter: Manoj Kumar Priority: Minor Since the API for LinearRegressionSummary has been merged, other models should follow suit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8996) Add Python API for Kolmogorov-Smirnov Test
[ https://issues.apache.org/jira/browse/SPARK-8996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626409#comment-14626409 ] Manoj Kumar commented on SPARK-8996: Hi, Can I work on this? Add Python API for Kolmogorov-Smirnov Test -- Key: SPARK-8996 URL: https://issues.apache.org/jira/browse/SPARK-8996 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Add Python API for the Kolmogorov-Smirnov test implemented in SPARK-8598. It should be similar to ChiSqTest in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3703) Ensemble learning methods
[ https://issues.apache.org/jira/browse/SPARK-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625859#comment-14625859 ] Manoj Kumar commented on SPARK-3703: Hi, I am interested in working on ensemble methods in general (as seen from my initial few pull requests). Are any of these targeted towards the 1.5 release? I'm asking because I might not be able to commit enough time after September. Ensemble learning methods - Key: SPARK-3703 URL: https://issues.apache.org/jira/browse/SPARK-3703 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley This is a general JIRA for coordinating on adding ensemble learning methods to MLlib. These methods include a variety of boosting and bagging algorithms. Below is a general design doc for ensemble methods (currently focused on boosting). Please comment here about general discussion and coordination; for comments about specific algorithms, please comment on their respective JIRAs. [Design doc for ensemble methods | https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed
[ https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625289#comment-14625289 ] Manoj Kumar edited comment on SPARK-7126 at 7/13/15 8:36 PM: - [~josephkb] 1. In scikit-learn predict outputs the same labels as the inputs. (Internally we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 1, .. n_labels - 1] (the numerically smallest get zero )in contrast to StringIndexer which gives the most frequent label the smallest. 2. I'm not sure it is necessary to show the users, what is being done internally. Should it not be sufficient to just give them the predicted output in terms of the input labels (I'm highly biased based on my previous experience in sklearn ;) ) Should we split the JIRA for different classifiers? (I haven't read the code yet, so I'm not quite sure if there is a generic way of doing this across all classifiers) was (Author: mechcoder): [~josephkb] 1. In scikit-learn predict outputs the same labels as the inputs. (Internally we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 1, .. n_labels - 1] in contrast to StringIndexer which gives the most frequent label the smallest. 2. I'm not sure it is necessary to show the users, what is being done internally. Should it not be sufficient to just give them the predicted output in terms of the input labels (I'm highly biased based on my previous experience in sklearn ;) ) Should we split the JIRA for different classifiers? (I haven't read the code yet, so I'm not quite sure if there is a generic way of doing this across all classifiers) For spark.ml Classifiers, automatically index labels if they are not yet indexed Key: SPARK-7126 URL: https://issues.apache.org/jira/browse/SPARK-7126 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Now that we have StringIndexer, we could have spark.ml.classification.Classifier (the abstraction) automatically handle label indexing if the labels are not yet indexed. This would require a bit of design: * Should predict() output the original labels or the indices? * How should we notify users that the labels are being automatically indexed? * How should we provide that index to the users? * If multiple parts of a Pipeline automatically index labels, what do we need to do to make sure they are consistent? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6261) Python MLlib API missing items: Feature
[ https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625241#comment-14625241 ] Manoj Kumar commented on SPARK-6261: We can mark this as resolved. I think? Python MLlib API missing items: Feature --- Key: SPARK-6261 URL: https://issues.apache.org/jira/browse/SPARK-6261 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. StandardScalerModel * All functionality except predict() is missing. IDFModel * idf Word2Vec * setMinCount Word2VecModel * getVectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed
[ https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625289#comment-14625289 ] Manoj Kumar commented on SPARK-7126: [~josephkb] 1. In scikit-learn predict outputs the same labels as the inputs. (Internally we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 1, .. n_labels - 1] in contrast to StringIndexer which gives the most frequent label the smallest. 2. I'm not sure it is necessary to show the users, what is being done internally. Should it not be sufficient to just give them the predicted output in terms of the input labels (I'm highly biased based on my previous experience in sklearn ;) ) Should we split the JIRA for different classifiers? (I haven't read the code yet, so I'm not quite sure if there is a generic way of doing this across all classifiers) For spark.ml Classifiers, automatically index labels if they are not yet indexed Key: SPARK-7126 URL: https://issues.apache.org/jira/browse/SPARK-7126 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Now that we have StringIndexer, we could have spark.ml.classification.Classifier (the abstraction) automatically handle label indexing if the labels are not yet indexed. This would require a bit of design: * Should predict() output the original labels or the indices? * How should we notify users that the labels are being automatically indexed? * How should we provide that index to the users? * If multiple parts of a Pipeline automatically index labels, what do we need to do to make sure they are consistent? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler (ML and PySpark)
[ https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-8704: --- Summary: Add missing methods in StandardScaler (ML and PySpark) (was: Add missing methods in StandardScaler) Add missing methods in StandardScaler (ML and PySpark) -- Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar std, mean to StandardScalerModel getVectors, findSynonyms to Word2Vec Model setFeatures and getFeatures to hashingTF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler
[ https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-8704: --- Summary: Add missing methods in StandardScaler (was: Add additional methods to wrappers in ml.pyspark.feature) Add missing methods in StandardScaler - Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar std, mean to StandardScalerModel getVectors, findSynonyms to Word2Vec Model setFeatures and getFeatures to hashingTF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8874) Add missing methods in Word2Vec ML
[ https://issues.apache.org/jira/browse/SPARK-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-8874: --- Component/s: PySpark ML Add missing methods in Word2Vec ML -- Key: SPARK-8874 URL: https://issues.apache.org/jira/browse/SPARK-8874 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar Add getVectors and findSynonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8874) Add missing methods in Word2Vec ML
Manoj Kumar created SPARK-8874: -- Summary: Add missing methods in Word2Vec ML Key: SPARK-8874 URL: https://issues.apache.org/jira/browse/SPARK-8874 Project: Spark Issue Type: New Feature Reporter: Manoj Kumar Add getVectors and findSynonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8823) Optimizations for sparse vector products in pyspark.mllib.linalg
Manoj Kumar created SPARK-8823: -- Summary: Optimizations for sparse vector products in pyspark.mllib.linalg Key: SPARK-8823 URL: https://issues.apache.org/jira/browse/SPARK-8823 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Currently we iterate over indices and values of both the sparse vectors that can be vectorized in NumPy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8706) Implement Pylint / Prospector checks for PySpark
[ https://issues.apache.org/jira/browse/SPARK-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611611#comment-14611611 ] Manoj Kumar commented on SPARK-8706: Sorry for sounding dumb, but the present code downloads pep8 as a script. However it seems that pylint is a repo, which again has two dependencies. What is the preferred way to do this in Spark? Implement Pylint / Prospector checks for PySpark Key: SPARK-8706 URL: https://issues.apache.org/jira/browse/SPARK-8706 Project: Spark Issue Type: New Feature Components: Project Infra, PySpark Reporter: Josh Rosen It would be nice to implement Pylint / Prospector (https://github.com/landscapeio/prospector) checks for PySpark. As with the style checker rules, I'll imagine that we'll want to roll out new rules gradually in order to avoid a mass refactoring commit. For starters, we should create a pull request that introduces the harness for running the linters, add a configuration file which enables only the lint checks that currently pass, and install the required dependencies on Jenkins. Once we've done this, we can open a series of smaller followup PRs to gradually enable more linting checks and to fix existing violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7401) Dot product and squared_distances should be vectorized in Vectors
[ https://issues.apache.org/jira/browse/SPARK-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-7401: --- Priority: Major (was: Minor) Dot product and squared_distances should be vectorized in Vectors - Key: SPARK-7401 URL: https://issues.apache.org/jira/browse/SPARK-7401 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8291) Add parse functionality to LabeledPoint in PySpark
[ https://issues.apache.org/jira/browse/SPARK-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar closed SPARK-8291. -- Resolution: Won't Fix Add parse functionality to LabeledPoint in PySpark -- Key: SPARK-8291 URL: https://issues.apache.org/jira/browse/SPARK-8291 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor It is useful to have functionality that can parse a string into a LabeledPoint while loading files, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8265) Add LinearDataGenerator to pyspark.mllib.utils
[ https://issues.apache.org/jira/browse/SPARK-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar resolved SPARK-8265. Resolution: Fixed Fix Version/s: 1.5.0 Add LinearDataGenerator to pyspark.mllib.utils -- Key: SPARK-8265 URL: https://issues.apache.org/jira/browse/SPARK-8265 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor Fix For: 1.5.0 This is useful in testing various linear models in pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3258) Python API for streaming MLlib algorithms
[ https://issues.apache.org/jira/browse/SPARK-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608757#comment-14608757 ] Manoj Kumar commented on SPARK-3258: [~mengxr] We can mark this as resolved. Python API for streaming MLlib algorithms - Key: SPARK-3258 URL: https://issues.apache.org/jira/browse/SPARK-3258 Project: Spark Issue Type: Umbrella Components: MLlib, PySpark, Streaming Reporter: Xiangrui Meng This is an umbrella JIRA to track Python port of streaming MLlib algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3258) Python API for streaming MLlib algorithms
[ https://issues.apache.org/jira/browse/SPARK-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar resolved SPARK-3258. Resolution: Fixed Fix Version/s: 1.5.0 Python API for streaming MLlib algorithms - Key: SPARK-3258 URL: https://issues.apache.org/jira/browse/SPARK-3258 Project: Spark Issue Type: Umbrella Components: MLlib, PySpark, Streaming Reporter: Xiangrui Meng Fix For: 1.5.0 This is an umbrella JIRA to track Python port of streaming MLlib algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8704) Add additional methods to wrappers in ml.pyspark.feature
Manoj Kumar created SPARK-8704: -- Summary: Add additional methods to wrappers in ml.pyspark.feature Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar std, mean to StandardScalerModel getVectors, findSynonyms to Word2Vec Model setFeatures and getFeatures to hashingTF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8706) Implement Pylint / Prospector checks for PySpark
[ https://issues.apache.org/jira/browse/SPARK-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606311#comment-14606311 ] Manoj Kumar commented on SPARK-8706: Mind if I hack on this? Implement Pylint / Prospector checks for PySpark Key: SPARK-8706 URL: https://issues.apache.org/jira/browse/SPARK-8706 Project: Spark Issue Type: New Feature Components: Project Infra, PySpark Reporter: Josh Rosen It would be nice to implement Pylint / Prospector (https://github.com/landscapeio/prospector) checks for PySpark. As with the style checker rules, I'll imagine that we'll want to roll out new rules gradually in order to avoid a mass refactoring commit. For starters, we should create a pull request that introduces the harness for running the linters, add a configuration file which enables only the lint checks that currently pass, and install the required dependencies on Jenkins. Once we've done this, we can open a series of smaller followup PRs to gradually enable more linting checks and to fix existing violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8633) List missing model methods in Python Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606266#comment-14606266 ] Manoj Kumar commented on SPARK-8633: I think that should be it. List missing model methods in Python Pipeline API - Key: SPARK-8633 URL: https://issues.apache.org/jira/browse/SPARK-8633 Project: Spark Issue Type: Task Components: ML, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Most Python models under the pipeline API are implemented as JavaModel wrappers. However, we didn't provide methods to extract information from model. In SPARK-7647, we added weights and intercept to linear models. This JIRA is to list all missing model methods, create JIRAs for each, and link them here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8711) Add additional methods to JavaModel wrappers in trees
Manoj Kumar created SPARK-8711: -- Summary: Add additional methods to JavaModel wrappers in trees Key: SPARK-8711 URL: https://issues.apache.org/jira/browse/SPARK-8711 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8678) Default values in Pipeline API should be immutable
Manoj Kumar created SPARK-8678: -- Summary: Default values in Pipeline API should be immutable Key: SPARK-8678 URL: https://issues.apache.org/jira/browse/SPARK-8678 Project: Spark Issue Type: Bug Components: ML, PySpark Reporter: Manoj Kumar If the default params are mutable, then if the function or method is called again without any value for the default params, then the changed values are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599753#comment-14599753 ] Manoj Kumar edited comment on SPARK-6724 at 6/24/15 5:13 PM: - [~hrishikesh91]] Are you actively working on this? Let us know if you need help. was (Author: mechcoder): [~hrishikesh] Are you actively working on this? Let us know if you need help. Model import/export for FPGrowth Key: SPARK-6724 URL: https://issues.apache.org/jira/browse/SPARK-6724 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6791) Model export/import for spark.ml: meta-algorithms
[ https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599777#comment-14599777 ] Manoj Kumar commented on SPARK-6791: Oh sorry, I read block on as block. Do you have any good JIRA'S in mind to start with the Pipeline API (both the Scala and the Python API). There are a number of issues but I'm no sure which one to start with. Model export/import for spark.ml: meta-algorithms - Key: SPARK-6791 URL: https://issues.apache.org/jira/browse/SPARK-6791 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Algorithms: Pipeline, CrossValidator (and associated models) This task will block on all other subtasks for [SPARK-6725]. This task will also include adding export/import as a required part of the PipelineStage interface since meta-algorithms will depend on sub-algorithms supporting save/load. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599753#comment-14599753 ] Manoj Kumar edited comment on SPARK-6724 at 6/24/15 5:12 PM: - [~hrishikesh] Are you actively working on this? Let us know if you need help. was (Author: mechcoder): [~hrishikesh] Are you actively working on this? Model import/export for FPGrowth Key: SPARK-6724 URL: https://issues.apache.org/jira/browse/SPARK-6724 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599753#comment-14599753 ] Manoj Kumar commented on SPARK-6724: [~hrishikesh] Are you actively working on this? Model import/export for FPGrowth Key: SPARK-6724 URL: https://issues.apache.org/jira/browse/SPARK-6724 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5694) Python API for evaluation metrics
[ https://issues.apache.org/jira/browse/SPARK-5694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar resolved SPARK-5694. Resolution: Fixed Python API for evaluation metrics - Key: SPARK-5694 URL: https://issues.apache.org/jira/browse/SPARK-5694 Project: Spark Issue Type: Umbrella Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng This is an umbrella JIRA for evaluation metrics in Python. They should be defined under `pyspark.mllib.evaluation`. We should try wrapping Scala's implementation instead of implement them in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6791) Model export/import for spark.ml: meta-algorithms
[ https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594802#comment-14594802 ] Manoj Kumar commented on SPARK-6791: [~josephkb] I would like to work on this. Which model do we need to start with? Model export/import for spark.ml: meta-algorithms - Key: SPARK-6791 URL: https://issues.apache.org/jira/browse/SPARK-6791 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Algorithms: Pipeline, CrossValidator (and associated models) This task will block on all other subtasks for [SPARK-6725]. This task will also include adding export/import as a required part of the PipelineStage interface since meta-algorithms will depend on sub-algorithms supporting save/load. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8479) Add numNonzeros and numActives to linalg.Matrices
Manoj Kumar created SPARK-8479: -- Summary: Add numNonzeros and numActives to linalg.Matrices Key: SPARK-8479 URL: https://issues.apache.org/jira/browse/SPARK-8479 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Manoj Kumar Priority: Minor Add numNonzeros to scan the number of non zero values and numActives to show the number of values explicitly stored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org