[jira] [Created] (SPARK-17146) Add RandomizedSearch to the CrossValidator API

2016-08-18 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-17146:
---

 Summary: Add RandomizedSearch to the CrossValidator API
 Key: SPARK-17146
 URL: https://issues.apache.org/jira/browse/SPARK-17146
 Project: Spark
  Issue Type: Improvement
Reporter: Manoj Kumar


Hi, I would like to add randomized search support for the Cross-Validator API. 
It should be quite straightforward to add with the present abstractions.

Here is the proposed API:
(Names are up for debate)

Proposed Classes:
"ParamSamplerBuilder" or a "ParamRandomizedBuilder" that returns an
Array of ParamMaps

Proposed Methods:
"addBounds"
"addSampler"
"setNumIter"

Code example:
{code}
def sampler(): Double = {
Math.pow(10.0, -5 + Random.nextFloat * (5 - (-5))
}
val paramGrid = new ParamRandomizedBuilder()
  .addSampler(lr.regParam, sampler)
  .addBounds(lr.elasticNetParam, 0.0, 1.0)
  .setNumIter(10)
  .build()
{code}

Let me know your thoughts!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17116) Allow params to be a {string, value} dict at fit time

2016-08-17 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425522#comment-15425522
 ] 

Manoj Kumar commented on SPARK-17116:
-

Haha, not really. I just found it odd that setParams accepts the parameter as a 
string, while params at fit time is an instance of Param.

> Allow params to be a {string, value} dict at fit time
> -
>
> Key: SPARK-17116
> URL: https://issues.apache.org/jira/browse/SPARK-17116
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Currently, it is possible to override the default params set at constructor 
> time by supplying a ParamMap which is essentially a (Param: value) dict.
> Looking at the codebase, it should be trivial to extend this to a (string, 
> value) representation.
> {code}
> # This hints that the maxiter param of the lr instance is modified in-place
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> lr.fit(dataset, {lr.maxIter: 20})
> # This seems more natural.
> lr.fit(dataset, {"maxIter": 20})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17118) Make examples Python3 compatible

2016-08-17 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-17118:
---

 Summary: Make examples Python3 compatible
 Key: SPARK-17118
 URL: https://issues.apache.org/jira/browse/SPARK-17118
 Project: Spark
  Issue Type: Improvement
Reporter: Manoj Kumar


There are various examples that do not work in Python 3. Most of them just 
include modifying the print statements.

(examples/src/main/python/ml/estimator_transformer_param_example.py) is one 
such example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17116) Allow params to be a {string, value} dict at fit time

2016-08-17 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-17116:

Description: 
Currently, it is possible to override the default params set at constructor 
time by supplying a ParamMap which is essentially a (Param: value) dict.
Looking at the codebase, it should be trivial to extend this to a (string, 
value) representation.

{code}
# This hints that the maxiter param of the lr instance is modified in-place
lr = LogisticRegression(maxIter=10, regParam=0.01)
lr.fit(dataset, {lr.maxIter: 20})

# This seems more natural.
lr.fit(dataset, {"maxIter": 20})
{code}

  was:
Currently, it is possible to override the default params set at constructor 
time by supplying a ParamMap which is essentially a (Param: value) dict.
Looking at the codebase, it should be trivial to extend this to a (string, 
value) representation.

{code}
# This hints that the maxiter param of the lr instance is modified in-place
lr = LogisticRegression(maxIter=10, regParam=0.01)
lr.fit(dataset, {lr.maxIter: 20})

# This seems more natural.
lr.fit(dataset, {"maxiter": 20})
{code}


> Allow params to be a {string, value} dict at fit time
> -
>
> Key: SPARK-17116
> URL: https://issues.apache.org/jira/browse/SPARK-17116
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Currently, it is possible to override the default params set at constructor 
> time by supplying a ParamMap which is essentially a (Param: value) dict.
> Looking at the codebase, it should be trivial to extend this to a (string, 
> value) representation.
> {code}
> # This hints that the maxiter param of the lr instance is modified in-place
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> lr.fit(dataset, {lr.maxIter: 20})
> # This seems more natural.
> lr.fit(dataset, {"maxIter": 20})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17116) Allow params to be a {string, value} dict at fit time

2016-08-17 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425494#comment-15425494
 ] 

Manoj Kumar edited comment on SPARK-17116 at 8/17/16 10:17 PM:
---

[~josephkb] [~mlnick] [~holdenk]
This is not super important, but I do think it will be helpful.


was (Author: mechcoder):
[~josephkb] [~mlnick]
This is not super important, but I do think it will be helpful.

> Allow params to be a {string, value} dict at fit time
> -
>
> Key: SPARK-17116
> URL: https://issues.apache.org/jira/browse/SPARK-17116
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Currently, it is possible to override the default params set at constructor 
> time by supplying a ParamMap which is essentially a (Param: value) dict.
> Looking at the codebase, it should be trivial to extend this to a (string, 
> value) representation.
> {code}
> # This hints that the maxiter param of the lr instance is modified in-place
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> lr.fit(dataset, {lr.maxIter: 20})
> # This seems more natural.
> lr.fit(dataset, {"maxiter": 20})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17116) Allow params to be a {string, value} dict at fit time

2016-08-17 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425494#comment-15425494
 ] 

Manoj Kumar commented on SPARK-17116:
-

[~josephkb] [~mlnick]
This is not super important, but I do think it will be helpful.

> Allow params to be a {string, value} dict at fit time
> -
>
> Key: SPARK-17116
> URL: https://issues.apache.org/jira/browse/SPARK-17116
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Currently, it is possible to override the default params set at constructor 
> time by supplying a ParamMap which is essentially a (Param: value) dict.
> Looking at the codebase, it should be trivial to extend this to a (string, 
> value) representation.
> {code}
> # This hints that the maxiter param of the lr instance is modified in-place
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> lr.fit(dataset, {lr.maxIter: 20})
> # This seems more natural.
> lr.fit(dataset, {"maxiter": 20})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17116) Allow params to be a {string, value} dict at fit time

2016-08-17 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-17116:

Summary: Allow params to be a {string, value} dict at fit time  (was: Allow 
params to be a {string, value} dict)

> Allow params to be a {string, value} dict at fit time
> -
>
> Key: SPARK-17116
> URL: https://issues.apache.org/jira/browse/SPARK-17116
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Currently, it is possible to override the default params set at constructor 
> time by supplying a ParamMap which is essentially a (Param: value) dict.
> Looking at the codebase, it should be trivial to extend this to a (string, 
> value) representation.
> {code}
> # This hints that the maxiter param of the lr instance is modified in-place
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> lr.fit(dataset, {lr.maxIter: 20})
> # This seems more natural.
> lr.fit(dataset, {"maxiter": 20})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17116) Allow params to be a {string, value} dict

2016-08-17 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-17116:
---

 Summary: Allow params to be a {string, value} dict
 Key: SPARK-17116
 URL: https://issues.apache.org/jira/browse/SPARK-17116
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Manoj Kumar
Priority: Minor


Currently, it is possible to override the default params set at constructor 
time by supplying a ParamMap which is essentially a (Param: value) dict.
Looking at the codebase, it should be trivial to extend this to a (string, 
value) representation.

{code}
# This hints that the maxiter param of the lr instance is modified in-place
lr = LogisticRegression(maxIter=10, regParam=0.01)
lr.fit(dataset, {lr.maxIter: 20})

# This seems more natural.
lr.fit(dataset, {"maxiter": 20})
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward

2016-07-08 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368602#comment-15368602
 ] 

Manoj Kumar commented on SPARK-16365:
-

Could you be a bit more clearer about the first point? Is it so that people can 
quickly prototype locally with a small subsample of the data before doing the 
dataframe | RDD conversion to handle huge amounts of data?

> Ideas for moving "mllib-local" forward
> --
>
> Key: SPARK-16365
> URL: https://issues.apache.org/jira/browse/SPARK-16365
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next 
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's 
> linear algebra", or "investigate how we will implement local models/pipelines 
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation 
> of linalg into a standalone project turned out to be significantly more 
> complex than originally expected. So I vote we devote sufficient discussion 
> and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory

2016-07-08 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368216#comment-15368216
 ] 

Manoj Kumar commented on SPARK-3728:


Hi [~xusen]. Are you still working on this?

> RandomForest: Learn models too large to store in memory
> ---
>
> Key: SPARK-3728
> URL: https://issues.apache.org/jira/browse/SPARK-3728
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Proposal: Write trees to disk as they are learned.
> RandomForest currently uses a FIFO queue, which means training all trees at 
> once via breadth-first search.  Using a FILO queue would encourage the code 
> to finish one tree before moving on to new ones.  This would allow the code 
> to write trees to disk as they are learned.
> Note: It would also be possible to write nodes to disk as they are learned 
> using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
> [Sequoia Forest package]() does this.  However, it could be useful to learn 
> trees progressively, so that future functionality such as early stopping 
> (training fewer trees than expected) could be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward

2016-07-07 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366978#comment-15366978
 ] 

Manoj Kumar commented on SPARK-16365:
-

Is the ultimate aim to make mllib-local, the scikit-learn of scala?

> Ideas for moving "mllib-local" forward
> --
>
> Key: SPARK-16365
> URL: https://issues.apache.org/jira/browse/SPARK-16365
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next 
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's 
> linear algebra", or "investigate how we will implement local models/pipelines 
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation 
> of linalg into a standalone project turned out to be significantly more 
> complex than originally expected. So I vote we devote sufficient discussion 
> and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"

2016-07-07 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366392#comment-15366392
 ] 

Manoj Kumar commented on SPARK-16399:
-

It would just run with the default python, that is in this case python 2.6

> Set PYSPARK_PYTHON to point to "python" instead of "python2.7"
> --
>
> Key: SPARK-16399
> URL: https://issues.apache.org/jira/browse/SPARK-16399
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Manoj Kumar
>Assignee: Manoj Kumar
>Priority: Minor
> Fix For: 2.1.0
>
>
> Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even 
> though higher versions of Python seem to be installed.
> It should be better to force "PYSPARK_PYTHON" to python instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16399) Set PYSPARK_PYTHON to point to "python" instead of "python2.7"

2016-07-06 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-16399:
---

 Summary: Set PYSPARK_PYTHON to point to "python" instead of 
"python2.7"
 Key: SPARK-16399
 URL: https://issues.apache.org/jira/browse/SPARK-16399
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Manoj Kumar
Priority: Minor


Right now, ./bin/pyspark forces "PYSPARK_PYTHON" to be "python2.7" even though 
higher versions of Python seem to be installed.
It should be better to force "PYSPARK_PYTHON" to python instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16307) Improve testing for DecisionTree variances

2016-06-29 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-16307:
---

 Summary: Improve testing for DecisionTree variances
 Key: SPARK-16307
 URL: https://issues.apache.org/jira/browse/SPARK-16307
 Project: Spark
  Issue Type: Test
Reporter: Manoj Kumar
Priority: Minor


The current test assumes that Impurity.calculate() returns the variance 
correctly. A better approach would be to test if the variance returned equals 
the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16306) Improve testing for DecisionTree variances

2016-06-29 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-16306:
---

 Summary: Improve testing for DecisionTree variances
 Key: SPARK-16306
 URL: https://issues.apache.org/jira/browse/SPARK-16306
 Project: Spark
  Issue Type: Test
Reporter: Manoj Kumar
Priority: Minor


The current test assumes that Impurity.calculate() returns the variance 
correctly. A better approach would be to test if the variance returned equals 
the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-06-23 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335165#comment-15335165
 ] 

Manoj Kumar edited comment on SPARK-14351 at 6/23/16 11:43 PM:
---

OK, so here are some benchmarks that validate your claims partially (All 
trained to maxDepth=30 and the auto feature selection strategy). The trend is 
that as the number of trees increase, it seems to have a higher impact. I'll 
see what I can optimize tomorrow.

|| n_tree ||  n_samples || n_features || totalTime ||  percent of total time 
spent in impurityCalculator || percent of total time spent in impurityStats ||
|1 |  1 |  500 | 7.90 | 0.328% | 0.01%
|10 |  1 |  500 | 7.67 | 1.3% | 0.12%
|100 |  1 |  500 | 18.156 | 5.19% | 0.29%
|1 |  500 |  1 | 7.1308 | 0.39% | 0.014%
|10 |  500 |  1 | 7.5506 | 1.37% | 0.12%
|100 |  500 |  1 | 17.61| 6.18% | 0.349%
|1 |  1000 |  1000 | 6.99 | 0.28% | 0.029%
|10 |  1000 |  1000 | 7.415  | 1.7% | 0.09%
|100 |  1000 |  1000 | 17.89 | 6.1% | 0.3%
|500 |  1000 |  1000 | 71.02 | 6.8% | 0.3%



was (Author: mechcoder):
OK, so here are some benchmarks that validate your claims partially (All 
trained to maxDepth=30 and the auto feature selection strategy). The trend is 
that as the number of trees increase, it seems to have a higher impact. I'll 
see what I can optimize tomorrow.

|| n_tree ||  n_samples || n_features || totalTime ||  percent of total time 
spent in impurityCalculator || percent of total time spent in impurityStats ||
|1 |  1 |  500 | 7.90 | 0.328% | 0.01%
|10 |  1 |  500 | 7.67 | 1.3% | 0.12%
|100 |  1 |  500 | 18.156 | 5.19% | 0.29%
1 |  500 |  1 | 7.1308 | 0.39% | 0.014%
|10 |  500 |  1 | 7.5506 | 1.37% | 0.12%
|100 |  500 |  1 | 17.61| 6.18% | 0.349%
|1 |  1000 |  1000 | 6.99 | 0.28% | 0.029%
|10 |  1000 |  1000 | 7.415  | 1.7% | 0.09%
|100 |  1000 |  1000 | 17.89 | 6.1% | 0.3%
|500 |  1000 |  1000 | 71.02 | 6.8% | 0.3%


> Optimize ImpurityAggregator for decision trees
> --
>
> Key: SPARK-14351
> URL: https://issues.apache.org/jira/browse/SPARK-14351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> {{RandomForest.binsToBestSplit}} currently takes a large amount of time.  
> Based on some quick profiling, I believe a big chunk of this is spent in 
> {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
> copies) and {{RandomForest.calculateImpurityStats}}.
> This JIRA is for:
> * Doing more profiling to confirm that unnecessary time is being spent in 
> some of these methods.
> * Optimizing the implementation
> * Profiling again to confirm the speedups
> Local profiling for large enough examples should suffice, especially since 
> the optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-06-23 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335165#comment-15335165
 ] 

Manoj Kumar edited comment on SPARK-14351 at 6/23/16 11:43 PM:
---

OK, so here are some benchmarks that validate your claims partially (All 
trained to maxDepth=30 and the auto feature selection strategy). The trend is 
that as the number of trees increase, it seems to have a higher impact. I'll 
see what I can optimize tomorrow.

|| n_tree ||  n_samples || n_features || totalTime ||  percent of total time 
spent in impurityCalculator || percent of total time spent in impurityStats ||
|1 |  1 |  500 | 7.90 | 0.328% | 0.01%
|10 |  1 |  500 | 7.67 | 1.3% | 0.12%
|100 |  1 |  500 | 18.156 | 5.19% | 0.29%
1 |  500 |  1 | 7.1308 | 0.39% | 0.014%
|10 |  500 |  1 | 7.5506 | 1.37% | 0.12%
|100 |  500 |  1 | 17.61| 6.18% | 0.349%
|1 |  1000 |  1000 | 6.99 | 0.28% | 0.029%
|10 |  1000 |  1000 | 7.415  | 1.7% | 0.09%
|100 |  1000 |  1000 | 17.89 | 6.1% | 0.3%
|500 |  1000 |  1000 | 71.02 | 6.8% | 0.3%



was (Author: mechcoder):
OK, so here are some benchmarks that validate your claims partially (All 
trained to maxDepth=30 and the auto feature selection strategy). The trend is 
that as the number of trees increase, it seems to have a higher impact. I'll 
see what I can optimize tomorrow.

|| n_tree ||  n_samples || n_features || totalTime ||  percent in 
binsToBestSplit || percent in impurityCalculator || percent in 
impurityStatsTime ||
|1 |  1 |  500 | 2 | 19.5% | 15% | 0.1%
|10 |  1 |  500 | 2.45 | 13% | 8.5%| 0.7%
|100 |  1 |  500 | 4.48 | 64.5% | 41.5% | 2.1%
|500 |  1 |  500 | 15.2 | 89.6% | 61.1% | 3.4%
|1 |  500 |  1 | 2.16 | 18.5% | 16.2% | ~
|10 |  500 |  1 | 2.70 | 14.8% | 11.1%| 0.4%
|100 |  500 |  1 | 9.07 | 43.5% | 31.4% | 1.9%
|1 |  1000 |  1000 | 2.02 | 24.7% | 14.8% | 0.2%
|10 |  1000 |  1000 | 6.2 | 12.8% | 9.6%| 0.1%
|50 |  1000 |  1000 | 4.05 | 38.5% | 28.8% | 2.8%
|100 |  1000 |  1000 | 10.19 | 45.3% | 30.6% | 3.18%


> Optimize ImpurityAggregator for decision trees
> --
>
> Key: SPARK-14351
> URL: https://issues.apache.org/jira/browse/SPARK-14351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> {{RandomForest.binsToBestSplit}} currently takes a large amount of time.  
> Based on some quick profiling, I believe a big chunk of this is spent in 
> {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
> copies) and {{RandomForest.calculateImpurityStats}}.
> This JIRA is for:
> * Doing more profiling to confirm that unnecessary time is being spent in 
> some of these methods.
> * Optimizing the implementation
> * Profiling again to confirm the speedups
> Local profiling for large enough examples should suffice, especially since 
> the optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-06-22 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-14351:

Comment: was deleted

(was: Here are my thoughts: Also ccing [~sethah] since he has seen this part of 
the codebase quite a few times to get some ideas.
Right now, a copy of size *statsSize* is created that is sliced from *allStats* 
for every instantiation of *ImpurityCalculator*. The reason to do this is 
because there are calls to *add*  and *subtract* that modify the 
*ImpurityCalculator* inplace. However the calls to *add* or *subtract* are very 
less as compared to the class instantiations. (roughly called one time for 
every 2 times *ImpurityCalculator* is instantiated.

I see two alternatives.

1. Pass the view directly to the *ImpurityCalculator*  and make a copy whenever 
*add* or *subtract* is called.
2. Pass *allStats*, *offset*, *offset + statsSize* to the *impurityCalculator* 
and make a copy of *allStats* whenever *add* or *subtract* is called.

Both will involve making *stats* a def, which would provide a copy whenever it 
is being called. The first one is more more favourable because the size of 
*allStats* is huge. WDYT?)

> Optimize ImpurityAggregator for decision trees
> --
>
> Key: SPARK-14351
> URL: https://issues.apache.org/jira/browse/SPARK-14351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> {{RandomForest.binsToBestSplit}} currently takes a large amount of time.  
> Based on some quick profiling, I believe a big chunk of this is spent in 
> {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
> copies) and {{RandomForest.calculateImpurityStats}}.
> This JIRA is for:
> * Doing more profiling to confirm that unnecessary time is being spent in 
> some of these methods.
> * Optimizing the implementation
> * Profiling again to confirm the speedups
> Local profiling for large enough examples should suffice, especially since 
> the optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-06-22 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345182#comment-15345182
 ] 

Manoj Kumar commented on SPARK-14351:
-

Here are my thoughts: Also ccing [~sethah] since he has seen this part of the 
codebase quite a few times to get some ideas.
Right now, a copy of size *statsSize* is created that is sliced from *allStats* 
for every instantiation of *ImpurityCalculator*. The reason to do this is 
because there are calls to *add*  and *subtract* that modify the 
*ImpurityCalculator* inplace. However the calls to *add* or *subtract* are very 
less as compared to the class instantiations. (roughly called one time for 
every 2 times *ImpurityCalculator* is instantiated.

I see two alternatives.

1. Pass the view directly to the *ImpurityCalculator*  and make a copy whenever 
*add* or *subtract* is called.
2. Pass *allStats*, *offset*, *offset + statsSize* to the *impurityCalculator* 
and make a copy of *allStats* whenever *add* or *subtract* is called.

Both will involve making *stats* a def, which would provide a copy whenever it 
is being called. The first one is more more favourable because the size of 
*allStats* is huge. WDYT?

> Optimize ImpurityAggregator for decision trees
> --
>
> Key: SPARK-14351
> URL: https://issues.apache.org/jira/browse/SPARK-14351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> {{RandomForest.binsToBestSplit}} currently takes a large amount of time.  
> Based on some quick profiling, I believe a big chunk of this is spent in 
> {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
> copies) and {{RandomForest.calculateImpurityStats}}.
> This JIRA is for:
> * Doing more profiling to confirm that unnecessary time is being spent in 
> some of these methods.
> * Optimizing the implementation
> * Profiling again to confirm the speedups
> Local profiling for large enough examples should suffice, especially since 
> the optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-06-16 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335165#comment-15335165
 ] 

Manoj Kumar commented on SPARK-14351:
-

OK, so here are some benchmarks that validate your claims partially (All 
trained to maxDepth=30 and the auto feature selection strategy). The trend is 
that as the number of trees increase, it seems to have a higher impact. I'll 
see what I can optimize tomorrow.

|| n_tree ||  n_samples || n_features || totalTime ||  percent in 
binsToBestSplit || percent in impurityCalculator || percent in 
impurityStatsTime ||
|1 |  1 |  500 | 2 | 19.5% | 15% | 0.1%
|10 |  1 |  500 | 2.45 | 13% | 8.5%| 0.7%
|100 |  1 |  500 | 4.48 | 64.5% | 41.5% | 2.1%
|500 |  1 |  500 | 15.2 | 89.6% | 61.1% | 3.4%
|1 |  500 |  1 | 2.16 | 18.5% | 16.2% | ~
|10 |  500 |  1 | 2.70 | 14.8% | 11.1%| 0.4%
|100 |  500 |  1 | 9.07 | 43.5% | 31.4% | 1.9%
|1 |  1000 |  1000 | 2.02 | 24.7% | 14.8% | 0.2%
|10 |  1000 |  1000 | 6.2 | 12.8% | 9.6%| 0.1%
|50 |  1000 |  1000 | 4.05 | 38.5% | 28.8% | 2.8%
|100 |  1000 |  1000 | 10.19 | 45.3% | 30.6% | 3.18%


> Optimize ImpurityAggregator for decision trees
> --
>
> Key: SPARK-14351
> URL: https://issues.apache.org/jira/browse/SPARK-14351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> {{RandomForest.binsToBestSplit}} currently takes a large amount of time.  
> Based on some quick profiling, I believe a big chunk of this is spent in 
> {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
> copies) and {{RandomForest.calculateImpurityStats}}.
> This JIRA is for:
> * Doing more profiling to confirm that unnecessary time is being spent in 
> some of these methods.
> * Optimizing the implementation
> * Profiling again to confirm the speedups
> Local profiling for large enough examples should suffice, especially since 
> the optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-06-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328941#comment-15328941
 ] 

Manoj Kumar commented on SPARK-14351:
-

I can try working on this.

> Optimize ImpurityAggregator for decision trees
> --
>
> Key: SPARK-14351
> URL: https://issues.apache.org/jira/browse/SPARK-14351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> {{RandomForest.binsToBestSplit}} currently takes a large amount of time.  
> Based on some quick profiling, I believe a big chunk of this is spent in 
> {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
> copies) and {{RandomForest.calculateImpurityStats}}.
> This JIRA is for:
> * Doing more profiling to confirm that unnecessary time is being spent in 
> some of these methods.
> * Optimizing the implementation
> * Profiling again to confirm the speedups
> Local profiling for large enough examples should suffice, especially since 
> the optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328939#comment-15328939
 ] 

Manoj Kumar edited comment on SPARK-3155 at 6/14/16 5:01 AM:
-

1. I agree that the use cases are limited to single trees. You kind of lose 
interpretability if you train the tree to maximum depth. It helps in improving 
interpretability while also improving on generalization performance. 
3. It is intuitive to prune the tree during training (i.e stop training after 
the validation error increases) . However this is very similar to just having a 
stopping criterion such as maximum depth, minimum samples in each node (except 
that the stopping criteria is dependent on validation data)
And is quite uncommon to do it. The standard practise (at least according to my 
lectures) is to train the tree to full depth and remove the leaves according to 
validation data.

However, if you feel that #14351 is more important, I can focus on that.


was (Author: mechcoder):
1. I agree that the use cases are limited to single trees. You kind of lose 
interpretability if you train the tree to maximum depth. It helps in improving 
interpretability while also improving on generalization performance. 
3. It is intuitive to prune the tree during training (i.e stop training after 
the validation error increases) . However this is very similar to just having a 
stopping criterion such as maximum depth, minimum samples in each node (except 
that the stopping criteria is dependent on validation data)
And is quite uncommon to do it. The standard practise (at least according to my 
lectures) is to train the train to full depth and remove the leaves according 
to validation data.

However, if you feel that #14351 is more important, I can focus on that.

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328939#comment-15328939
 ] 

Manoj Kumar commented on SPARK-3155:


1. I agree that the use cases are limited to single trees. You kind of lose 
interpretability if you train the tree to maximum depth. It helps in improving 
interpretability while also improving on generalization performance. 
3. It is intuitive to prune the tree during training (i.e stop training after 
the validation error increases) . However this is very similar to just having a 
stopping criterion such as maximum depth, minimum samples in each node (except 
that the stopping criteria is dependent on validation data)
And is quite uncommon to do it. The standard practise (at least according to my 
lectures) is to train the train to full depth and remove the leaves according 
to validation data.

However, if you feel that #14351 is more important, I can focus on that.

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328592#comment-15328592
 ] 

Manoj Kumar commented on SPARK-3155:


I would like to add support for pruning DecisionTrees as part of my internship.

Some API related questions:

Support for DecisionTree pruning in R is done in this way:

prune(fit, cp=)

A very straightforward extension would be to start would be to:

model.prune(validationData, errorTol=)

where model is a fit DecisionTreeRegressionModel would stop pruning when the 
improvement in error is not above a certain tolerance. Does that sound like a 
good idea?


> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-3155:
---
Comment: was deleted

(was: I would like to add support for pruning DecisionTrees as part of my 
internship.

Some API related questions:

Support for DecisionTree pruning in R is done in this way:

prune(fit, cp=)

A very straightforward extension would be to start would be to:

model.prune(validationData, errorTol=)

where model is a fit DecisionTreeRegressionModel would stop pruning when the 
improvement in error is not above a certain tolerance. Does that sound like a 
good idea?
)

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328487#comment-15328487
 ] 

Manoj Kumar commented on SPARK-3155:


I would like to add support for pruning DecisionTrees as part of my internship.

Some API related questions:

Support for DecisionTree pruning in R is done in this way:

prune(fit, cp=)

A very straightforward extension would be to start would be to:

model.prune(validationData, errorTol=)

where model is a fit DecisionTreeRegressionModel would stop pruning when the 
improvement in error is not above a certain tolerance. Does that sound like a 
good idea?


> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9623) RandomForestRegressor: provide variance of predictions

2016-06-07 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319450#comment-15319450
 ] 

Manoj Kumar commented on SPARK-9623:


[~yanboliang] Are you still working on this? Would you mind if I take over?

> RandomForestRegressor: provide variance of predictions
> --
>
> Key: SPARK-9623
> URL: https://issues.apache.org/jira/browse/SPARK-9623
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Variance of predicted value, as estimated from training data.
> Analogous to class probabilities for classification.
> See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15761) pyspark shell should load if PYSPARK_DRIVER_PYTHON is ipython an Python3

2016-06-03 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-15761:
---

 Summary: pyspark shell should load if PYSPARK_DRIVER_PYTHON is 
ipython an Python3
 Key: SPARK-15761
 URL: https://issues.apache.org/jira/browse/SPARK-15761
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Manoj Kumar
Priority: Minor


My default python is ipython3 and it is odd that it fails with "IPython 
requires Python 2.7+; please install python2.7 or set PYSPARK_PYTHON"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

2015-08-21 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706722#comment-14706722
 ] 

Manoj Kumar commented on SPARK-6192:


[~rxin] It gets over in a few hours from now. I have written a blog post 
summarizing my work done during this summer.
https://manojbits.wordpress.com/2015/08/21/google-summer-of-code-wrapup/

I think this can be marked as resolved :)

cc: [~josephkb]

 Enhance MLlib's Python API (GSoC 2015)
 --

 Key: SPARK-6192
 URL: https://issues.apache.org/jira/browse/SPARK-6192
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
  Labels: gsoc, gsoc2015, mentor

 This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme 
 is to enhance MLlib's Python API, to make it on par with the Scala/Java API. 
 The main tasks are:
 1. For all models in MLlib, provide save/load method. This also
 includes save/load in Scala.
 2. Python API for evaluation metrics.
 3. Python API for streaming ML algorithms.
 4. Python API for distributed linear algebra.
 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use
 customized serialization, making MLLibPythonAPI hard to maintain. It
 would be nice to use the DataFrames for serialization.
 I'll link the JIRAs for each of the tasks.
 Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. 
 The TODO list will be dynamic based on the backlog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9848) Add @Since annotation to new public APIs in 1.5

2015-08-19 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703133#comment-14703133
 ] 

Manoj Kumar commented on SPARK-9848:


Do we want to tag spark.ml in this release as well?

 Add @Since annotation to new public APIs in 1.5
 ---

 Key: SPARK-9848
 URL: https://issues.apache.org/jira/browse/SPARK-9848
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
Priority: Critical
  Labels: starter

 We should get a list of new APIs from SPARK-9660. cc: [~fliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9848) Add @since tag to new public APIs in 1.5

2015-08-19 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702808#comment-14702808
 ] 

Manoj Kumar commented on SPARK-9848:


Well, actually in the JIRA linked to I could find no new additions. There is 
only a single change that marks an aux constructor of tree as private. Can we 
mark this as resolved?

 Add @since tag to new public APIs in 1.5
 

 Key: SPARK-9848
 URL: https://issues.apache.org/jira/browse/SPARK-9848
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
Priority: Critical
  Labels: starter

 We should get a list of new APIs from SPARK-9660. cc: [~fliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib

2015-08-19 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702816#comment-14702816
 ] 

Manoj Kumar commented on SPARK-7751:


Did you forget to add mllib.feature?

 Add @since to stable and experimental methods in MLlib
 --

 Key: SPARK-7751
 URL: https://issues.apache.org/jira/browse/SPARK-7751
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
  Labels: starter

 This is useful to check whether a feature exists in some version of Spark. 
 This is an umbrella JIRA to track the progress. We want to have @since tag 
 for both stable (those without any Experimental/DeveloperApi/AlphaComponent 
 annotations) and experimental methods in MLlib:
 (Do NOT tag private or package private classes or methods.)
 * an example PR for Scala: https://github.com/apache/spark/pull/6101
 * an example PR for Python: https://github.com/apache/spark/pull/6295
 We need to dig the history of git commit to figure out what was the Spark 
 version when a method was first introduced. Take `NaiveBayes.setModelType` as 
 an example. We can grep `def setModelType` at different version git tags.
 {code}
 meng@xm:~/src/spark
 $ git show 
 v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
 meng@xm:~/src/spark
 $ git show 
 v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
   def setModelType(modelType: String): NaiveBayes = {
 {code}
 If there are better ways, please let us know.
 We cannot add all @since tags in a single PR, which is hard to review. So we 
 made some subtasks for each package, for example 
 `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
 and the `spark.ml` package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10108) Add @since tags to mllib.feature

2015-08-19 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-10108:
---

 Summary: Add @since tags to mllib.feature
 Key: SPARK-10108
 URL: https://issues.apache.org/jira/browse/SPARK-10108
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Manoj Kumar
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10082) Validate i, j in apply (Dense and Sparse Matrices)

2015-08-18 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-10082:

Component/s: MLlib

 Validate i, j in apply (Dense and Sparse Matrices)
 --

 Key: SPARK-10082
 URL: https://issues.apache.org/jira/browse/SPARK-10082
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Manoj Kumar
Priority: Minor

 Given row_ind should be less than the number of rows
 Given col_ind should be less than the number of cols.
 The current code in master gives unpredictable behavior for such cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9911) User guide for MulticlassClassificationEvaluator

2015-08-18 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700802#comment-14700802
 ] 

Manoj Kumar commented on SPARK-9911:


Umm. What additional advantage does the MulticlassClassificationEvaluator (or 
the Evaluator abstract class) have on top of the evaluate methods planned to be 
added in all the transformer models? (eg in LogisticRegression and 
LinearRegression as of now) 

cc: [~josephkb]

 User guide for MulticlassClassificationEvaluator
 

 Key: SPARK-9911
 URL: https://issues.apache.org/jira/browse/SPARK-9911
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Manoj Kumar

 SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is 
 not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm 
 Guides}} to document this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10082) Validate i, j in apply (Dense and Sparse Matrices)

2015-08-18 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-10082:
---

 Summary: Validate i, j in apply (Dense and Sparse Matrices)
 Key: SPARK-10082
 URL: https://issues.apache.org/jira/browse/SPARK-10082
 Project: Spark
  Issue Type: Bug
Reporter: Manoj Kumar
Priority: Minor


Given row_ind should be less than the number of rows
Given col_ind should be less than the number of cols.

The current code in master gives unpredictable behavior for such cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9911) User guide for MulticlassClassificationEvaluator

2015-08-18 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701573#comment-14701573
 ] 

Manoj Kumar commented on SPARK-9911:


Ah I see. Thanks for the clarification. Where should the user guide for these 
evaluators go? I do not see any user-guide for BinaryClassificationEvaluator as 
well.

Or should I just add an example in docs/ml-guide.md to tune a problem with a 
MulticlassClassificationEvaluator?

 User guide for MulticlassClassificationEvaluator
 

 Key: SPARK-9911
 URL: https://issues.apache.org/jira/browse/SPARK-9911
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Manoj Kumar

 SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is 
 not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm 
 Guides}} to document this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9911) User guide for MulticlassClassificationEvaluator

2015-08-18 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701572#comment-14701572
 ] 

Manoj Kumar commented on SPARK-9911:


Ah I see. Thanks for the clarification. Where should the user guide for these 
evaluators go? I do not see any user-guide for BinaryClassificationEvaluator as 
well.

Or should I just add an example in docs/ml-guide.md to tune a problem with a 
MulticlassClassificationEvaluator?

 User guide for MulticlassClassificationEvaluator
 

 Key: SPARK-9911
 URL: https://issues.apache.org/jira/browse/SPARK-9911
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Manoj Kumar

 SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is 
 not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm 
 Guides}} to document this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9911) User guide for MulticlassClassificationEvaluator

2015-08-18 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-9911:
---
Comment: was deleted

(was: Ah I see. Thanks for the clarification. Where should the user guide for 
these evaluators go? I do not see any user-guide for 
BinaryClassificationEvaluator as well.

Or should I just add an example in docs/ml-guide.md to tune a problem with a 
MulticlassClassificationEvaluator?)

 User guide for MulticlassClassificationEvaluator
 

 Key: SPARK-9911
 URL: https://issues.apache.org/jira/browse/SPARK-9911
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Manoj Kumar

 SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is 
 not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm 
 Guides}} to document this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9911) User guide for MulticlassClassificationEvaluator

2015-08-16 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698841#comment-14698841
 ] 

Manoj Kumar commented on SPARK-9911:


Can I work on this?

 User guide for MulticlassClassificationEvaluator
 

 Key: SPARK-9911
 URL: https://issues.apache.org/jira/browse/SPARK-9911
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang

 SPARK-7690 adds MulticlassClassificationEvaluator to ML Pipelines which is 
 not present in MLlib. We need to update the user-guide ({{ml-guide#Algorithm 
 Guides}} to document this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9906) User guide for LogisticRegressionSummary

2015-08-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695156#comment-14695156
 ] 

Manoj Kumar commented on SPARK-9906:


Sure !

 User guide for LogisticRegressionSummary
 

 Key: SPARK-9906
 URL: https://issues.apache.org/jira/browse/SPARK-9906
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang

 SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model 
 statistics to ML pipeline logistic regression models. This feature is not 
 present in mllib and should be documented within {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6364) hashCode and equals for Matrices

2015-08-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695468#comment-14695468
 ] 

Manoj Kumar commented on SPARK-6364:


It is all right, there is enough work for everybody :P

 hashCode and equals for Matrices
 

 Key: SPARK-6364
 URL: https://issues.apache.org/jira/browse/SPARK-6364
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Manoj Kumar

 hashCode implementation should be similar to Vector's. But we may want to 
 reduce the complexity by scanning only a few nonzeros instead of all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9919) Matrices should respect Java's equals and hashCode contract

2015-08-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695491#comment-14695491
 ] 

Manoj Kumar commented on SPARK-9919:


OK, but I need to make some changes. I'll resubmit the PR in some time.

 Matrices should respect Java's equals and hashCode contract
 ---

 Key: SPARK-9919
 URL: https://issues.apache.org/jira/browse/SPARK-9919
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Priority: Critical

 The contract for Java's Object is that a.equals(b) implies a.hashCode == 
 b.hashCode. So usually we need to implement both. The problem with hashCode 
 is that we shouldn't compute it based on all values, which could be very 
 expensive. You can use the implementation of Vector.hashCode as a template, 
 but that requires some changes to avoid hash code collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9919) Matrices should respect Java's equals and hashCode contract

2015-08-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695492#comment-14695492
 ] 

Manoj Kumar commented on SPARK-9919:


OK, but I need to make some changes. I'll resubmit the PR in some time.

 Matrices should respect Java's equals and hashCode contract
 ---

 Key: SPARK-9919
 URL: https://issues.apache.org/jira/browse/SPARK-9919
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Priority: Critical

 The contract for Java's Object is that a.equals(b) implies a.hashCode == 
 b.hashCode. So usually we need to implement both. The problem with hashCode 
 is that we shouldn't compute it based on all values, which could be very 
 expensive. You can use the implementation of Vector.hashCode as a template, 
 but that requires some changes to avoid hash code collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8633) List missing model methods in Python Pipeline API

2015-08-06 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar resolved SPARK-8633.

Resolution: Fixed

 List missing model methods in Python Pipeline API
 -

 Key: SPARK-8633
 URL: https://issues.apache.org/jira/browse/SPARK-8633
 Project: Spark
  Issue Type: Task
  Components: ML, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar

 Most Python models under the pipeline API are implemented as JavaModel 
 wrappers. However, we didn't provide methods to extract information from 
 model. In SPARK-7647, we added weights and intercept to linear models. This 
 JIRA is to list all missing model methods, create JIRAs for each, and link 
 them here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8633) List missing model methods in Python Pipeline API

2015-08-06 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660372#comment-14660372
 ] 

Manoj Kumar commented on SPARK-8633:


Should I mark this as resolved?

 List missing model methods in Python Pipeline API
 -

 Key: SPARK-8633
 URL: https://issues.apache.org/jira/browse/SPARK-8633
 Project: Spark
  Issue Type: Task
  Components: ML, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar

 Most Python models under the pipeline API are implemented as JavaModel 
 wrappers. However, we didn't provide methods to extract information from 
 model. In SPARK-7647, we added weights and intercept to linear models. This 
 JIRA is to list all missing model methods, create JIRAs for each, and link 
 them here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix

2015-08-05 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658600#comment-14658600
 ] 

Manoj Kumar commented on SPARK-6488:


I'll create a JIRA in a while. I am just adding support for multiply method in 
RowMatrix since it goes with the PCA example (along with PCA and SVD). Others 
can be done by you :) Thanks!

 Support addition/multiplication in PySpark's BlockMatrix
 

 Key: SPARK-6488
 URL: https://issues.apache.org/jira/browse/SPARK-6488
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We 
 should reuse the Scala implementation instead of having a separate 
 implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9655) Add missing methods to linalg.distributed

2015-08-05 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-9655:
--

 Summary: Add missing methods to linalg.distributed
 Key: SPARK-9655
 URL: https://issues.apache.org/jira/browse/SPARK-9655
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Manoj Kumar


Missing methods in linalg.distributed

RowMatrix
1. computeGramianMatrix
2. computeCovariance
3. computeColumnSummaryStatistics
4. columnSimilarities
5. tallSkinnyQR

IndexedRowMatrix
1. computeGramianMatrix()

CoordinateMatrix
1. transpose()




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9655) Add missing methods to linalg.distributed

2015-08-05 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar closed SPARK-9655.
--
Resolution: Duplicate

 Add missing methods to linalg.distributed
 -

 Key: SPARK-9655
 URL: https://issues.apache.org/jira/browse/SPARK-9655
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Manoj Kumar

 Missing methods in linalg.distributed
 RowMatrix
 1. computeGramianMatrix
 2. computeCovariance
 3. computeColumnSummaryStatistics
 4. columnSimilarities
 5. tallSkinnyQR
 IndexedRowMatrix
 1. computeGramianMatrix()
 CoordinateMatrix
 1. transpose()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9656) Add missing methods to linalg.distributed

2015-08-05 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658937#comment-14658937
 ] 

Manoj Kumar commented on SPARK-9656:


cc: [~mwdus...@us.ibm.com]

 Add missing methods to linalg.distributed
 -

 Key: SPARK-9656
 URL: https://issues.apache.org/jira/browse/SPARK-9656
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Manoj Kumar

 Missing methods in linalg.distributed
 RowMatrix
 1. computeGramianMatrix
 2. computeCovariance
 3. computeColumnSummaryStatistics
 4. columnSimilarities
 5. tallSkinnyQR
 IndexedRowMatrix
 1. computeGramianMatrix()
 CoordinateMatrix
 1. transpose()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9656) Add missing methods to linalg.distributed

2015-08-05 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-9656:
--

 Summary: Add missing methods to linalg.distributed
 Key: SPARK-9656
 URL: https://issues.apache.org/jira/browse/SPARK-9656
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Manoj Kumar


Missing methods in linalg.distributed

RowMatrix
1. computeGramianMatrix
2. computeCovariance
3. computeColumnSummaryStatistics
4. columnSimilarities
5. tallSkinnyQR

IndexedRowMatrix
1. computeGramianMatrix()

CoordinateMatrix
1. transpose()




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9533) Add missing methods in Word2Vec ML (Python API)

2015-08-04 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-9533:
---
Component/s: PySpark

 Add missing methods in Word2Vec ML (Python API)
 ---

 Key: SPARK-9533
 URL: https://issues.apache.org/jira/browse/SPARK-9533
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Manoj Kumar
Priority: Minor

 After 8874 is resolved, we can add python wrappers for the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9484) Word2Vec import/export for original binary format

2015-08-03 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651899#comment-14651899
 ] 

Manoj Kumar commented on SPARK-9484:


I just went through the C code that does the .bin reading.

What would be the best way to go about this? The codepaths should be almost 
completely different if path.endsWith(.bin) or not right? Also should this 
use the SaveLoadV1_0 object, or should we have a different object (say 
SaveLoadBinary) which would keep the codepaths independent and help easier 
maintenance?

 Word2Vec import/export for original binary format
 -

 Key: SPARK-9484
 URL: https://issues.apache.org/jira/browse/SPARK-9484
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 It would be nice to add model import/export for Word2Vec which handles the 
 original binary format used by [https://code.google.com/p/word2vec/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8874) Add missing methods in Word2Vec ML

2015-08-02 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650619#comment-14650619
 ] 

Manoj Kumar commented on SPARK-8874:


Done. Thanks.

 Add missing methods in Word2Vec ML
 --

 Key: SPARK-8874
 URL: https://issues.apache.org/jira/browse/SPARK-8874
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar
Assignee: Manoj Kumar

 Add getVectors and findSynonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9533) Add missing methods in Word2Vec ML (Python API)

2015-08-02 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-9533:
--

 Summary: Add missing methods in Word2Vec ML (Python API)
 Key: SPARK-9533
 URL: https://issues.apache.org/jira/browse/SPARK-9533
 Project: Spark
  Issue Type: Improvement
Reporter: Manoj Kumar


After 8874 is resolved, we can add python wrappers for the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2015-08-01 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650314#comment-14650314
 ] 

Manoj Kumar commented on SPARK-6227:


[~mengxr] Can this be assigned to me? Since the blockmatrix PR is already 
worked on.

 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9525) Optimize SparseVector initializations in linalg

2015-08-01 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-9525:
---
Priority: Major  (was: Minor)

 Optimize SparseVector initializations in linalg
 ---

 Key: SPARK-9525
 URL: https://issues.apache.org/jira/browse/SPARK-9525
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar

 1. Remove sorting of indices and assume that the user gives a sorted tuple of 
 indices, values etc
 2. Avoid iterating twice to get the indices and values if the argument 
 provided is a dict.
 3. Add checks such that the length of the indices should be less than the 
 size provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9525) Optimize SparseVector initializations in linalg

2015-08-01 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-9525:
--

 Summary: Optimize SparseVector initializations in linalg
 Key: SPARK-9525
 URL: https://issues.apache.org/jira/browse/SPARK-9525
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor


1. Remove sorting of indices and assume that the user gives a sorted tuple of 
indices, values etc

2. Avoid iterating twice to get the indices and values if the argument provided 
is a dict.

3. Add checks such that the length of the indices should be less than the size 
provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length

2015-07-30 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647380#comment-14647380
 ] 

Manoj Kumar commented on SPARK-9277:


I will not have access to a development environment till Saturday. Feel free to 
fix it. Thanks.

 SparseVector constructor must throw an error when declared number of elements 
 less than array length
 

 Key: SPARK-9277
 URL: https://issues.apache.org/jira/browse/SPARK-9277
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Andrey Vykhodtsev
Priority: Minor
  Labels: starter
 Attachments: SparseVector test.html, SparseVector test.ipynb


 I found that one can create SparseVector inconsistently and it will lead to 
 an Java error in runtime, for example when training LogisticRegressionWithSGD.
 Here is the test case:
 In [2]:
 sc.version
 Out[2]:
 u'1.3.1'
 In [13]:
 from pyspark.mllib.linalg import SparseVector
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithSGD
 In [3]:
 x =  SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5})
 In [10]:
 l = LabeledPoint(0, x)
 In [12]:
 r = sc.parallelize([l])
 In [14]:
 m = LogisticRegressionWithSGD.train(r)
 Error:
 Py4JJavaError: An error occurred while calling 
 o86.trainLogisticRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 
 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2
 Attached is the notebook with the scenario and the full message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9408) Refactor mllib/linalg.py to mllib/linalg

2015-07-28 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-9408:
--

 Summary: Refactor mllib/linalg.py to mllib/linalg
 Key: SPARK-9408
 URL: https://issues.apache.org/jira/browse/SPARK-9408
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Reporter: Manoj Kumar


We need to refactor mllib/linalg.py to mllib/linalg so that the project 
structure is similar to that of Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array lenght

2015-07-24 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640055#comment-14640055
 ] 

Manoj Kumar commented on SPARK-9277:


I have labelled this as started. Will fix this in 4 days, if no one comes 
forward by then.

 SparseVector constructor must throw an error when declared number of elements 
 less than array lenght
 

 Key: SPARK-9277
 URL: https://issues.apache.org/jira/browse/SPARK-9277
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Andrey Vykhodtsev
Priority: Minor
  Labels: starter
 Attachments: SparseVector test.html, SparseVector test.ipynb


 I found that one can create SparseVector inconsistently and it will lead to 
 an Java error in runtime, for example when training LogisticRegressionWithSGD.
 Here is the test case:
 In [2]:
 sc.version
 Out[2]:
 u'1.3.1'
 In [13]:
 from pyspark.mllib.linalg import SparseVector
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithSGD
 In [3]:
 x =  SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5})
 In [10]:
 l = LabeledPoint(0, x)
 In [12]:
 r = sc.parallelize([l])
 In [14]:
 m = LogisticRegressionWithSGD.train(r)
 Error:
 Py4JJavaError: An error occurred while calling 
 o86.trainLogisticRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 
 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2
 Attached is the notebook with the scenario and the full message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array lenght

2015-07-24 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-9277:
---
Labels: starter  (was: )

 SparseVector constructor must throw an error when declared number of elements 
 less than array lenght
 

 Key: SPARK-9277
 URL: https://issues.apache.org/jira/browse/SPARK-9277
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Andrey Vykhodtsev
Priority: Minor
  Labels: starter
 Attachments: SparseVector test.html, SparseVector test.ipynb


 I found that one can create SparseVector inconsistently and it will lead to 
 an Java error in runtime, for example when training LogisticRegressionWithSGD.
 Here is the test case:
 In [2]:
 sc.version
 Out[2]:
 u'1.3.1'
 In [13]:
 from pyspark.mllib.linalg import SparseVector
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithSGD
 In [3]:
 x =  SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5})
 In [10]:
 l = LabeledPoint(0, x)
 In [12]:
 r = sc.parallelize([l])
 In [14]:
 m = LogisticRegressionWithSGD.train(r)
 Error:
 Py4JJavaError: An error occurred while calling 
 o86.trainLogisticRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 
 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2
 Attached is the notebook with the scenario and the full message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7105) Support model save/load in Python's GaussianMixture

2015-07-21 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635424#comment-14635424
 ] 

Manoj Kumar commented on SPARK-7105:


Hi, Are you still working on this?

 Support model save/load in Python's GaussianMixture
 ---

 Key: SPARK-7105
 URL: https://issues.apache.org/jira/browse/SPARK-7105
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yu Ishikawa
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9223) Support model save/load in Python's LDA

2015-07-21 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-9223:
--

 Summary: Support model save/load in Python's LDA
 Key: SPARK-9223
 URL: https://issues.apache.org/jira/browse/SPARK-9223
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9222) Make class instantiation variables in DistributedLDAModel [private] clustering

2015-07-21 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-9222:
--

 Summary: Make class instantiation variables in DistributedLDAModel 
[private] clustering
 Key: SPARK-9222
 URL: https://issues.apache.org/jira/browse/SPARK-9222
 Project: Spark
  Issue Type: Test
  Components: MLlib
Reporter: Manoj Kumar
Priority: Minor


This would enable testing the various class variables like docConcentration, 
topicConcentration etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6486) Add BlockMatrix in PySpark

2015-07-17 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631182#comment-14631182
 ] 

Manoj Kumar commented on SPARK-6486:


Great, I will start on this after the weekend.

 Add BlockMatrix in PySpark
 --

 Key: SPARK-6486
 URL: https://issues.apache.org/jira/browse/SPARK-6486
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 We should add BlockMatrix to PySpark. Internally, we can use DataFrames and 
 MatrixUDT for serialization. This JIRA should contain conversions between 
 IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover 
 linear algebra operations of block matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9112) Implement LogisticRegressionSummary similar to LinearRegressionSummary

2015-07-16 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630053#comment-14630053
 ] 

Manoj Kumar commented on SPARK-9112:


Yes, that is the idea. Also we need not port it to ML right now, we could 
convert the transformed the dataframe to the required input type in mllib

Also it might be useful to return the probability for the predicted class (as 
done by predict_proba in scikit-learn) . how does that sound?

 Implement LogisticRegressionSummary similar to LinearRegressionSummary
 --

 Key: SPARK-9112
 URL: https://issues.apache.org/jira/browse/SPARK-9112
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Manoj Kumar
Priority: Minor

 Since the API for LinearRegressionSummary has been merged, other models 
 should follow suit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6001) K-Means clusterer should return the assignments of input points to clusters

2015-07-16 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629808#comment-14629808
 ] 

Manoj Kumar commented on SPARK-6001:


I just started to work on this.

 K-Means clusterer should return the assignments of input points to clusters
 ---

 Key: SPARK-6001
 URL: https://issues.apache.org/jira/browse/SPARK-6001
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: Derrick Burns
Priority: Minor

 The K-Means clusterer returns a KMeansModel that contains the cluster 
 centers. However, when available, I suggest that the K-Means clusterer also 
 return an RDD of the assignments of the input data to the clusters. While the 
 assignments can be computed given the KMeansModel, why not return assignments 
 if they are available to save re-computation costs.
 The K-means implementation at 
 https://github.com/derrickburns/generalized-kmeans-clustering returns the 
 assignments when available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9112) Implement LogisticRegressionSummary similar to LinearRegressionSummary

2015-07-16 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630014#comment-14630014
 ] 

Manoj Kumar commented on SPARK-9112:


Indeed, it seems so. But the merged LinearRegressionSummary also has just 
RegressionMetrics (unless there is anything that I missed.)
By the way, I'm not sure how to set the priority labels, sorry if it is wrong 
(the umbrella JIRA has a critical priority, so I though that this might qualify 
as major)

 Implement LogisticRegressionSummary similar to LinearRegressionSummary
 --

 Key: SPARK-9112
 URL: https://issues.apache.org/jira/browse/SPARK-9112
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Manoj Kumar
Priority: Minor

 Since the API for LinearRegressionSummary has been merged, other models 
 should follow suit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6001) K-Means clusterer should return the assignments of input points to clusters

2015-07-16 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629981#comment-14629981
 ] 

Manoj Kumar commented on SPARK-6001:


Oops. I just figured out we do not have a KMeans yet in spark.ml

 K-Means clusterer should return the assignments of input points to clusters
 ---

 Key: SPARK-6001
 URL: https://issues.apache.org/jira/browse/SPARK-6001
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: Derrick Burns
Priority: Minor

 The K-Means clusterer returns a KMeansModel that contains the cluster 
 centers. However, when available, I suggest that the K-Means clusterer also 
 return an RDD of the assignments of the input data to the clusters. While the 
 assignments can be computed given the KMeansModel, why not return assignments 
 if they are available to save re-computation costs.
 The K-means implementation at 
 https://github.com/derrickburns/generalized-kmeans-clustering returns the 
 assignments when available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9112) Implement LogisticRegressionSummary similar to LinearRegressionSummary

2015-07-16 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-9112:
--

 Summary: Implement LogisticRegressionSummary similar to 
LinearRegressionSummary
 Key: SPARK-9112
 URL: https://issues.apache.org/jira/browse/SPARK-9112
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Manoj Kumar


Since the API for LinearRegressionSummary has been merged, other models should 
follow suit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9112) Implement LogisticRegressionSummary similar to LinearRegressionSummary

2015-07-16 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630132#comment-14630132
 ] 

Manoj Kumar commented on SPARK-9112:


I see that these are fields in the transformed dataframe already (the raw 
predictions, probablity etc). I think just wrapping up the metrics should 
suffice. I'll send a PR in a while

 Implement LogisticRegressionSummary similar to LinearRegressionSummary
 --

 Key: SPARK-9112
 URL: https://issues.apache.org/jira/browse/SPARK-9112
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Manoj Kumar
Priority: Minor

 Since the API for LinearRegressionSummary has been merged, other models 
 should follow suit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8996) Add Python API for Kolmogorov-Smirnov Test

2015-07-14 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626409#comment-14626409
 ] 

Manoj Kumar commented on SPARK-8996:


Hi, Can I work on this?

 Add Python API for Kolmogorov-Smirnov Test
 --

 Key: SPARK-8996
 URL: https://issues.apache.org/jira/browse/SPARK-8996
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 Add Python API for the Kolmogorov-Smirnov test implemented in SPARK-8598. It 
 should be similar to ChiSqTest in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3703) Ensemble learning methods

2015-07-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625859#comment-14625859
 ] 

Manoj Kumar commented on SPARK-3703:


Hi, I am interested in working on ensemble methods in general (as seen from my 
initial few pull requests). Are any of these targeted towards the 1.5 release? 
I'm asking because I might not be able to commit enough time after September.

 Ensemble learning methods
 -

 Key: SPARK-3703
 URL: https://issues.apache.org/jira/browse/SPARK-3703
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 This is a general JIRA for coordinating on adding ensemble learning methods 
 to MLlib.  These methods include a variety of boosting and bagging 
 algorithms.  Below is a general design doc for ensemble methods (currently 
 focused on boosting).  Please comment here about general discussion and 
 coordination; for comments about specific algorithms, please comment on their 
 respective JIRAs.
 [Design doc for ensemble methods | 
 https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed

2015-07-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625289#comment-14625289
 ] 

Manoj Kumar edited comment on SPARK-7126 at 7/13/15 8:36 PM:
-

[~josephkb]

1. In scikit-learn predict outputs the same labels as the inputs. (Internally 
we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 
1, .. n_labels - 1] (the numerically smallest get zero )in contrast to 
StringIndexer which gives the most frequent label the smallest.

2. I'm not sure it is necessary to show the users, what is being done 
internally. Should it not be sufficient to just give them the predicted output 
in terms of the input labels (I'm highly biased based on my previous experience 
in sklearn ;) )

Should we split the JIRA for different classifiers? (I haven't read the code 
yet, so I'm not quite sure if there is a generic way of doing this across all 
classifiers)



was (Author: mechcoder):
[~josephkb]

1. In scikit-learn predict outputs the same labels as the inputs. (Internally 
we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 
1, .. n_labels - 1] in contrast to StringIndexer which gives the most frequent 
label the smallest.

2. I'm not sure it is necessary to show the users, what is being done 
internally. Should it not be sufficient to just give them the predicted output 
in terms of the input labels (I'm highly biased based on my previous experience 
in sklearn ;) )

Should we split the JIRA for different classifiers? (I haven't read the code 
yet, so I'm not quite sure if there is a generic way of doing this across all 
classifiers)


 For spark.ml Classifiers, automatically index labels if they are not yet 
 indexed
 

 Key: SPARK-7126
 URL: https://issues.apache.org/jira/browse/SPARK-7126
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Now that we have StringIndexer, we could have 
 spark.ml.classification.Classifier (the abstraction) automatically handle 
 label indexing if the labels are not yet indexed.
 This would require a bit of design:
 * Should predict() output the original labels or the indices?
 * How should we notify users that the labels are being automatically indexed?
 * How should we provide that index to the users?
 * If multiple parts of a Pipeline automatically index labels, what do we need 
 to do to make sure they are consistent?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6261) Python MLlib API missing items: Feature

2015-07-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625241#comment-14625241
 ] 

Manoj Kumar commented on SPARK-6261:


We can mark this as resolved. I think?

 Python MLlib API missing items: Feature
 ---

 Key: SPARK-6261
 URL: https://issues.apache.org/jira/browse/SPARK-6261
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 StandardScalerModel
 * All functionality except predict() is missing.
 IDFModel
 * idf
 Word2Vec
 * setMinCount
 Word2VecModel
 * getVectors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed

2015-07-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625289#comment-14625289
 ] 

Manoj Kumar commented on SPARK-7126:


[~josephkb]

1. In scikit-learn predict outputs the same labels as the inputs. (Internally 
we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 
1, .. n_labels - 1] in contrast to StringIndexer which gives the most frequent 
label the smallest.

2. I'm not sure it is necessary to show the users, what is being done 
internally. Should it not be sufficient to just give them the predicted output 
in terms of the input labels (I'm highly biased based on my previous experience 
in sklearn ;) )

Should we split the JIRA for different classifiers? (I haven't read the code 
yet, so I'm not quite sure if there is a generic way of doing this across all 
classifiers)


 For spark.ml Classifiers, automatically index labels if they are not yet 
 indexed
 

 Key: SPARK-7126
 URL: https://issues.apache.org/jira/browse/SPARK-7126
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Now that we have StringIndexer, we could have 
 spark.ml.classification.Classifier (the abstraction) automatically handle 
 label indexing if the labels are not yet indexed.
 This would require a bit of design:
 * Should predict() output the original labels or the indices?
 * How should we notify users that the labels are being automatically indexed?
 * How should we provide that index to the users?
 * If multiple parts of a Pipeline automatically index labels, what do we need 
 to do to make sure they are consistent?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler (ML and PySpark)

2015-07-07 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-8704:
---
Summary: Add missing methods in StandardScaler (ML and PySpark)  (was: Add 
missing methods in StandardScaler)

 Add missing methods in StandardScaler (ML and PySpark)
 --

 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar

 std, mean to StandardScalerModel
 getVectors, findSynonyms to Word2Vec Model
 setFeatures and getFeatures to hashingTF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler

2015-07-07 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-8704:
---
Summary: Add missing methods in StandardScaler  (was: Add additional 
methods to wrappers in ml.pyspark.feature)

 Add missing methods in StandardScaler
 -

 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar

 std, mean to StandardScalerModel
 getVectors, findSynonyms to Word2Vec Model
 setFeatures and getFeatures to hashingTF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8874) Add missing methods in Word2Vec ML

2015-07-07 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-8874:
---
Component/s: PySpark
 ML

 Add missing methods in Word2Vec ML
 --

 Key: SPARK-8874
 URL: https://issues.apache.org/jira/browse/SPARK-8874
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar

 Add getVectors and findSynonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8874) Add missing methods in Word2Vec ML

2015-07-07 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-8874:
--

 Summary: Add missing methods in Word2Vec ML
 Key: SPARK-8874
 URL: https://issues.apache.org/jira/browse/SPARK-8874
 Project: Spark
  Issue Type: New Feature
Reporter: Manoj Kumar


Add getVectors and findSynonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8823) Optimizations for sparse vector products in pyspark.mllib.linalg

2015-07-04 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-8823:
--

 Summary: Optimizations for sparse vector products in 
pyspark.mllib.linalg
 Key: SPARK-8823
 URL: https://issues.apache.org/jira/browse/SPARK-8823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar


Currently we iterate over indices and values of both the sparse vectors that 
can be vectorized in NumPy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8706) Implement Pylint / Prospector checks for PySpark

2015-07-02 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611611#comment-14611611
 ] 

Manoj Kumar commented on SPARK-8706:


Sorry for sounding dumb, but the present code downloads pep8 as a script. 
However it seems that pylint is a repo, which again has two dependencies. What 
is the preferred way to do this in Spark?

 Implement Pylint / Prospector checks for PySpark
 

 Key: SPARK-8706
 URL: https://issues.apache.org/jira/browse/SPARK-8706
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra, PySpark
Reporter: Josh Rosen

 It would be nice to implement Pylint / Prospector 
 (https://github.com/landscapeio/prospector) checks for PySpark. As with the 
 style checker rules, I'll imagine that we'll want to roll out new rules 
 gradually in order to avoid a mass refactoring commit.
 For starters, we should create a pull request that introduces the harness for 
 running the linters, add a configuration file which enables only the lint 
 checks that currently pass, and install the required dependencies on Jenkins. 
 Once we've done this, we can open a series of smaller followup PRs to 
 gradually enable more linting checks and to fix existing violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7401) Dot product and squared_distances should be vectorized in Vectors

2015-07-02 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-7401:
---
Priority: Major  (was: Minor)

 Dot product and squared_distances should be vectorized in Vectors
 -

 Key: SPARK-7401
 URL: https://issues.apache.org/jira/browse/SPARK-7401
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8291) Add parse functionality to LabeledPoint in PySpark

2015-07-01 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar closed SPARK-8291.
--
Resolution: Won't Fix

 Add parse functionality to LabeledPoint in PySpark
 --

 Key: SPARK-8291
 URL: https://issues.apache.org/jira/browse/SPARK-8291
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor

 It is useful to have functionality that can parse a string into a 
 LabeledPoint while loading files, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8265) Add LinearDataGenerator to pyspark.mllib.utils

2015-07-01 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar resolved SPARK-8265.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Add LinearDataGenerator to pyspark.mllib.utils
 --

 Key: SPARK-8265
 URL: https://issues.apache.org/jira/browse/SPARK-8265
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor
 Fix For: 1.5.0


 This is useful in testing various linear models in pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3258) Python API for streaming MLlib algorithms

2015-06-30 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608757#comment-14608757
 ] 

Manoj Kumar commented on SPARK-3258:


[~mengxr] We can mark this as resolved.

 Python API for streaming MLlib algorithms
 -

 Key: SPARK-3258
 URL: https://issues.apache.org/jira/browse/SPARK-3258
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, PySpark, Streaming
Reporter: Xiangrui Meng

 This is an umbrella JIRA to track Python port of streaming MLlib algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3258) Python API for streaming MLlib algorithms

2015-06-30 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar resolved SPARK-3258.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Python API for streaming MLlib algorithms
 -

 Key: SPARK-3258
 URL: https://issues.apache.org/jira/browse/SPARK-3258
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, PySpark, Streaming
Reporter: Xiangrui Meng
 Fix For: 1.5.0


 This is an umbrella JIRA to track Python port of streaming MLlib algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8704) Add additional methods to wrappers in ml.pyspark.feature

2015-06-29 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-8704:
--

 Summary: Add additional methods to wrappers in ml.pyspark.feature
 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar


std, mean to StandardScalerModel
getVectors, findSynonyms to Word2Vec Model
setFeatures and getFeatures to hashingTF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8706) Implement Pylint / Prospector checks for PySpark

2015-06-29 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606311#comment-14606311
 ] 

Manoj Kumar commented on SPARK-8706:


Mind if I hack on this?

 Implement Pylint / Prospector checks for PySpark
 

 Key: SPARK-8706
 URL: https://issues.apache.org/jira/browse/SPARK-8706
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra, PySpark
Reporter: Josh Rosen

 It would be nice to implement Pylint / Prospector 
 (https://github.com/landscapeio/prospector) checks for PySpark. As with the 
 style checker rules, I'll imagine that we'll want to roll out new rules 
 gradually in order to avoid a mass refactoring commit.
 For starters, we should create a pull request that introduces the harness for 
 running the linters, add a configuration file which enables only the lint 
 checks that currently pass, and install the required dependencies on Jenkins. 
 Once we've done this, we can open a series of smaller followup PRs to 
 gradually enable more linting checks and to fix existing violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8633) List missing model methods in Python Pipeline API

2015-06-29 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606266#comment-14606266
 ] 

Manoj Kumar commented on SPARK-8633:


I think that should be it.

 List missing model methods in Python Pipeline API
 -

 Key: SPARK-8633
 URL: https://issues.apache.org/jira/browse/SPARK-8633
 Project: Spark
  Issue Type: Task
  Components: ML, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar

 Most Python models under the pipeline API are implemented as JavaModel 
 wrappers. However, we didn't provide methods to extract information from 
 model. In SPARK-7647, we added weights and intercept to linear models. This 
 JIRA is to list all missing model methods, create JIRAs for each, and link 
 them here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8711) Add additional methods to JavaModel wrappers in trees

2015-06-29 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-8711:
--

 Summary: Add additional methods to JavaModel wrappers in trees
 Key: SPARK-8711
 URL: https://issues.apache.org/jira/browse/SPARK-8711
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8678) Default values in Pipeline API should be immutable

2015-06-27 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-8678:
--

 Summary: Default values in Pipeline API should be immutable
 Key: SPARK-8678
 URL: https://issues.apache.org/jira/browse/SPARK-8678
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Reporter: Manoj Kumar


If the default params are mutable, then if the function or method is called 
again without any value for the default params, then the changed values are 
used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6724) Model import/export for FPGrowth

2015-06-24 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599753#comment-14599753
 ] 

Manoj Kumar edited comment on SPARK-6724 at 6/24/15 5:13 PM:
-

[~hrishikesh91]] Are you actively working on this? Let us know if you need help.


was (Author: mechcoder):
[~hrishikesh] Are you actively working on this? Let us know if you need help.

 Model import/export for FPGrowth
 

 Key: SPARK-6724
 URL: https://issues.apache.org/jira/browse/SPARK-6724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6791) Model export/import for spark.ml: meta-algorithms

2015-06-24 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599777#comment-14599777
 ] 

Manoj Kumar commented on SPARK-6791:


Oh sorry, I read block on as block. Do you have any good JIRA'S in mind to 
start with the Pipeline API (both the Scala and the Python API). There are a 
number of issues but I'm no sure which one to start with.

 Model export/import for spark.ml: meta-algorithms
 -

 Key: SPARK-6791
 URL: https://issues.apache.org/jira/browse/SPARK-6791
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Algorithms: Pipeline, CrossValidator (and associated models)
 This task will block on all other subtasks for [SPARK-6725].  This task will 
 also include adding export/import as a required part of the PipelineStage 
 interface since meta-algorithms will depend on sub-algorithms supporting 
 save/load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6724) Model import/export for FPGrowth

2015-06-24 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599753#comment-14599753
 ] 

Manoj Kumar edited comment on SPARK-6724 at 6/24/15 5:12 PM:
-

[~hrishikesh] Are you actively working on this? Let us know if you need help.


was (Author: mechcoder):
[~hrishikesh] Are you actively working on this?

 Model import/export for FPGrowth
 

 Key: SPARK-6724
 URL: https://issues.apache.org/jira/browse/SPARK-6724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-06-24 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599753#comment-14599753
 ] 

Manoj Kumar commented on SPARK-6724:


[~hrishikesh] Are you actively working on this?

 Model import/export for FPGrowth
 

 Key: SPARK-6724
 URL: https://issues.apache.org/jira/browse/SPARK-6724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5694) Python API for evaluation metrics

2015-06-23 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar resolved SPARK-5694.

Resolution: Fixed

 Python API for evaluation metrics
 -

 Key: SPARK-5694
 URL: https://issues.apache.org/jira/browse/SPARK-5694
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 This is an umbrella JIRA for evaluation metrics in Python. They should be 
 defined under `pyspark.mllib.evaluation`. We should try wrapping Scala's 
 implementation instead of implement them in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6791) Model export/import for spark.ml: meta-algorithms

2015-06-20 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594802#comment-14594802
 ] 

Manoj Kumar commented on SPARK-6791:


[~josephkb] I would like to work on this. Which model do we need to start with?

 Model export/import for spark.ml: meta-algorithms
 -

 Key: SPARK-6791
 URL: https://issues.apache.org/jira/browse/SPARK-6791
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Algorithms: Pipeline, CrossValidator (and associated models)
 This task will block on all other subtasks for [SPARK-6725].  This task will 
 also include adding export/import as a required part of the PipelineStage 
 interface since meta-algorithms will depend on sub-algorithms supporting 
 save/load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8479) Add numNonzeros and numActives to linalg.Matrices

2015-06-19 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-8479:
--

 Summary: Add numNonzeros and numActives to linalg.Matrices
 Key: SPARK-8479
 URL: https://issues.apache.org/jira/browse/SPARK-8479
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Manoj Kumar
Priority: Minor


Add
numNonzeros to scan the number of non zero values and numActives to show the 
number of values explicitly stored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >