[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2016-10-13 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572545#comment-15572545
 ] 

Gayathri Murali commented on SPARK-12664:
-

[~yanboliang] I am not working on this. Please feel free to take it

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17135) Consolidate code in linear/logistic regression where possible

2016-08-19 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428401#comment-15428401
 ] 

Gayathri Murali commented on SPARK-17135:
-

I can work on this

> Consolidate code in linear/logistic regression where possible
> -
>
> Key: SPARK-17135
> URL: https://issues.apache.org/jira/browse/SPARK-17135
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> There is shared code between MultinomialLogisticRegression, 
> LogisticRegression, and LinearRegression. We should consolidate where 
> possible. Also, we should move some code out of LogisticRegression.scala into 
> a separate util file or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16838) Add PMML export for ML KMeans in PySpark

2016-08-02 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404591#comment-15404591
 ] 

Gayathri Murali commented on SPARK-16838:
-

I can work on this

> Add PMML export for ML KMeans in PySpark
> 
>
> Key: SPARK-16838
> URL: https://issues.apache.org/jira/browse/SPARK-16838
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>
> After we finish SPARK-11237 we should also expose PMML export in the Python 
> API for KMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA

2016-07-06 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365533#comment-15365533
 ] 

Gayathri Murali commented on SPARK-16240:
-

I can work on this

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Bug
>Reporter: yuhao yang
>Priority: Minor
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns

2016-06-21 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15343235#comment-15343235
 ] 

Gayathri Murali commented on SPARK-16000:
-

[~yuhaoyan] I can help with this. 

> Make model loading backward compatible with saved models using old vector 
> columns
> -
>
> Key: SPARK-16000
> URL: https://issues.apache.org/jira/browse/SPARK-16000
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. The main incompatibility is the 
> vector column type change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers

2016-06-17 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337335#comment-15337335
 ] 

Gayathri Murali commented on SPARK-15997:
-

https://github.com/apache/spark/pull/13745 - This is right link to the PR

> Audit ml.feature Update documentation for ml feature transformers
> -
>
> Key: SPARK-15997
> URL: https://issues.apache.org/jira/browse/SPARK-15997
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Gayathri Murali
>Assignee: Gayathri Murali
>
> This JIRA is a subtask of SPARK-15100 and improves documentation for new 
> features added to 
> 1. HashingTF
> 2. Countvectorizer
> 3. QuantileDiscretizer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers

2016-06-16 Thread Gayathri Murali (JIRA)
Gayathri Murali created SPARK-15997:
---

 Summary: Audit ml.feature Update documentation for ml feature 
transformers
 Key: SPARK-15997
 URL: https://issues.apache.org/jira/browse/SPARK-15997
 Project: Spark
  Issue Type: Documentation
  Components: ML, MLlib
Affects Versions: 2.0.0
Reporter: Gayathri Murali


This JIRA is a subtask of SPARK-15100 and improves documentation for new 
features added to 
1. HashingTF
2. Countvectorizer
3. QuantileDiscretizer




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-15 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332588#comment-15332588
 ] 

Gayathri Murali commented on SPARK-15930:
-

[~yuhaoyan] If you havent already started working on this, I can send the PR. 

> Add Row count property to FPGrowth model
> 
>
> Key: SPARK-15930
> URL: https://issues.apache.org/jira/browse/SPARK-15930
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: John Aherne
>Priority: Minor
>  Labels: fp-growth, mllib
>
> Add a row count property to MLlib's FPGrowth model. 
> When using the model from FPGrowth, a count of the total number of records is 
> often necessary. 
> It appears that the function already calculates that value when training the 
> model, so it would save time not having to do it again outside the model. 
> Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
> Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15785) Add initialModel param to Gaussian Mixture Model (GMM) in spark.ml

2016-06-07 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319100#comment-15319100
 ] 

Gayathri Murali commented on SPARK-15785:
-

I will work on this. Thanks!

> Add initialModel param to Gaussian Mixture Model (GMM) in spark.ml
> --
>
> Key: SPARK-15785
> URL: https://issues.apache.org/jira/browse/SPARK-15785
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this param is needed for SPARK-4591: algorithm/model parity for 
> spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14381) Review spark.ml parity for feature transformers

2016-06-03 Thread Gayathri Murali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gayathri Murali updated SPARK-14381:

Comment: was deleted

(was: I will work on this)

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features

2016-06-03 Thread Gayathri Murali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gayathri Murali updated SPARK-15041:

Comment: was deleted

(was: I can work on this)

> adding mode strategy for ml.feature.Imputer for categorical features
> 
>
> Key: SPARK-15041
> URL: https://issues.apache.org/jira/browse/SPARK-15041
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Adding mode strategy for ml.feature.Imputer for categorical features. This 
> need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601
> From comments of jkbradley and Nick Pentreath in the PR
> {quote}
> Investigate efficiency of approaches using DataFrame/Dataset and/or approx 
> approaches such as frequentItems or Count-Min Sketch (will require an update 
> to CMS to return "heavy-hitters").
> investigate if we can use metadata to only allow mode for categorical 
> features (or perhaps as an easier alternative, allow mode for only Int/Long 
> columns)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15201) Handle integer overflow correctly in hash code computation

2016-06-03 Thread Gayathri Murali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gayathri Murali updated SPARK-15201:

Comment: was deleted

(was: I can work on this )

> Handle integer overflow correctly in hash code computation
> --
>
> Key: SPARK-15201
> URL: https://issues.apache.org/jira/browse/SPARK-15201
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> mult31AndAdd() function in utils.R does not handle integer overflow well. We 
> need find a solution to exactly match the hash code computation algorithm for 
> string in JDK.
> For details, refer to the discussion in 
> https://github.com/apache/spark/pull/10436



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers

2016-05-31 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15309112#comment-15309112
 ] 

Gayathri Murali commented on SPARK-14381:
-

I will work on this

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15672) R programming guide update

2016-05-31 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15309100#comment-15309100
 ] 

Gayathri Murali commented on SPARK-15672:
-

[~shivaram] I am working on changing R documentation to include all changes 
that happened with ML. Here is the link to the PR : 
https://github.com/apache/spark/pull/13285



> R programming guide update
> --
>
> Key: SPARK-15672
> URL: https://issues.apache.org/jira/browse/SPARK-15672
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> Update the the programming guide (i.e. the document at 
> http://spark.apache.org/docs/latest/sparkr.html) to cover the major new 
> features in Spark 2.0. This will include 
> (a) UDFs with dapply, dapplyCollect
> (b) group UDFs with gapply 
> (c) spark.lapply for running parallel R functions
> (d) others ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-20 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293767#comment-15293767
 ] 

Gayathri Murali edited comment on SPARK-15442 at 5/20/16 5:26 PM:
--

[~mlnick] I am working on 15100. I am just about finished. If you have one 
already, please go ahead. I would take about an hour to send PR for this one


was (Author: gayathrimurali):
[~mlnick] I am working on 15100. I am just about finished. If you have one 
already, please go ahead. It will take an hour to send one for this

> PySpark QuantileDiscretizer missing "relativeError" param
> -
>
> Key: SPARK-15442
> URL: https://issues.apache.org/jira/browse/SPARK-15442
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-20 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293767#comment-15293767
 ] 

Gayathri Murali commented on SPARK-15442:
-

[~mlnick] I am working on 15100. I am just about finished. If you have one 
already, please go ahead. It will take an hour to send one for this

> PySpark QuantileDiscretizer missing "relativeError" param
> -
>
> Key: SPARK-15442
> URL: https://issues.apache.org/jira/browse/SPARK-15442
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-20 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293731#comment-15293731
 ] 

Gayathri Murali commented on SPARK-15442:
-

I will help with this.

> PySpark QuantileDiscretizer missing "relativeError" param
> -
>
> Key: SPARK-15442
> URL: https://issues.apache.org/jira/browse/SPARK-15442
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15098) Audit: ml.classification

2016-05-18 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290025#comment-15290025
 ] 

Gayathri Murali commented on SPARK-15098:
-

[~yanboliang] Are you working on this? 

> Audit: ml.classification
> 
>
> Key: SPARK-15098
> URL: https://issues.apache.org/jira/browse/SPARK-15098
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian

2016-05-18 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290007#comment-15290007
 ] 

Gayathri Murali commented on SPARK-15194:
-

[~holdenk] I see that mllib/stat/distribution.py has the Python class for mllib 
version of MultiVariateGaussian. Are you looking to creating a similar 
stat/distribution.py at pyspark/ml as well for the mllib-local version of 
MultiVariateGaussian? 

> Add Python ML API for MultivariateGaussian
> --
>
> Key: SPARK-15194
> URL: https://issues.apache.org/jira/browse/SPARK-15194
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a PySpark API for the MLLib version but not the ML version. This 
> would allow Python's  `GaussianMixture` to more closely match the Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15100) Audit: ml.feature

2016-05-18 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289513#comment-15289513
 ] 

Gayathri Murali commented on SPARK-15100:
-

While making changes to CountVectorizer, HashingTF and QuantileDiscretizer I 
found the following issues

1. RelativeError is not available in Python for Quantile Discretizer. 
2. In built Python examples in feature.py does not include the newly added 
parameters such as Binary or relative Error
3. I am making changes to the example source code to include these parameters 
in model building. I hope this is expected

For 1 and 2, Should I send out a separate PR?

> Audit: ml.feature
> -
>
> Key: SPARK-15100
> URL: https://issues.apache.org/jira/browse/SPARK-15100
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15100) Audit: ml.feature

2016-05-18 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289513#comment-15289513
 ] 

Gayathri Murali edited comment on SPARK-15100 at 5/18/16 6:25 PM:
--

While making changes to CountVectorizer, HashingTF and QuantileDiscretizer I 
found the following issues

1. RelativeError is not available in Python for Quantile Discretizer. 
2. Inbuilt Python examples in feature.py does not include the newly added 
binary toggle parameter in both HashingTF and CountVectorizer
3. I am making changes to the example source code to include these parameters 
so that it reflects in the user doc. I hope this is expected

For 1 and 2, Should I send out a separate PR?


was (Author: gayathrimurali):
While making changes to CountVectorizer, HashingTF and QuantileDiscretizer I 
found the following issues

1. RelativeError is not available in Python for Quantile Discretizer. 
2. In built Python examples in feature.py does not include the newly added 
parameters such as Binary or relative Error
3. I am making changes to the example source code to include these parameters 
in model building. I hope this is expected

For 1 and 2, Should I send out a separate PR?

> Audit: ml.feature
> -
>
> Key: SPARK-15100
> URL: https://issues.apache.org/jira/browse/SPARK-15100
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15100) Audit: ml.feature

2016-05-18 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289394#comment-15289394
 ] 

Gayathri Murali commented on SPARK-15100:
-

[~bryanc] I have the PR ready for Countvectorizer, hashingTf and 
QuantileDiscretizer. Do you mind if I send it? Can you please help review

> Audit: ml.feature
> -
>
> Key: SPARK-15100
> URL: https://issues.apache.org/jira/browse/SPARK-15100
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15254) Improve ML pipeline Cross Validation Scaladoc & PyDoc

2016-05-17 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288505#comment-15288505
 ] 

Gayathri Murali commented on SPARK-15254:
-

I can work on this

> Improve ML pipeline Cross Validation Scaladoc & PyDoc
> -
>
> Key: SPARK-15254
> URL: https://issues.apache.org/jira/browse/SPARK-15254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Priority: Minor
>
> The ML pipeline Cross Validation Scaladoc & PyDoc is very spares - we should 
> fill this out with a more concrete description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15129) Clarify conventions for calling Spark and MLlib from R

2016-05-13 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283209#comment-15283209
 ] 

Gayathri Murali commented on SPARK-15129:
-

||API Change||User Guide updated||
|Family and Link Function to glm()|No|
|Summary statistics for glm|No|
|SparkR API consistent with R|No|
|K Means wrapper|No|
|Naïve Bayes wrapper|No|
|AFT Survival regression wrapper|No|
|Model persistence to KMeans, Naïve Bayes, AFTSurvivalRegression and glm|No|

These are the major API changes that happened in 2.0. Please point out if I 
missed anything

> Clarify conventions for calling Spark and MLlib from R
> --
>
> Key: SPARK-15129
> URL: https://issues.apache.org/jira/browse/SPARK-15129
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, SparkR
>Reporter: Joseph K. Bradley
>
> Since some R API modifications happened in 2.0, we need to make the new 
> standards clear in the user guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15129) Clarify conventions for calling Spark and MLlib from R

2016-05-12 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281698#comment-15281698
 ] 

Gayathri Murali commented on SPARK-15129:
-

I can work on this

> Clarify conventions for calling Spark and MLlib from R
> --
>
> Key: SPARK-15129
> URL: https://issues.apache.org/jira/browse/SPARK-15129
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, SparkR
>Reporter: Joseph K. Bradley
>
> Since some R API modifications happened in 2.0, we need to make the new 
> standards clear in the user guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15201) Handle integer overflow correctly in hash code computation

2016-05-12 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281679#comment-15281679
 ] 

Gayathri Murali commented on SPARK-15201:
-

I can work on this 

> Handle integer overflow correctly in hash code computation
> --
>
> Key: SPARK-15201
> URL: https://issues.apache.org/jira/browse/SPARK-15201
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> mult31AndAdd() function in utils.R does not handle integer overflow well. We 
> need find a solution to exactly match the hash code computation algorithm for 
> string in JDK.
> For details, refer to the discussion in 
> https://github.com/apache/spark/pull/10436



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features

2016-04-30 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265411#comment-15265411
 ] 

Gayathri Murali commented on SPARK-15041:
-

I can work on this

> adding mode strategy for ml.feature.Imputer for categorical features
> 
>
> Key: SPARK-15041
> URL: https://issues.apache.org/jira/browse/SPARK-15041
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Adding mode strategy for ml.feature.Imputer for categorical features. This 
> need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601
> From comments of jkbradley and Nick Pentreath in the PR
> {quote}
> Investigate efficiency of approaches using DataFrame/Dataset and/or approx 
> approaches such as frequentItems or Count-Min Sketch (will require an update 
> to CMS to return "heavy-hitters").
> investigate if we can use metadata to only allow mode for categorical 
> features (or perhaps as an easier alternative, allow mode for only Int/Long 
> columns)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14894) Python GaussianMixture summary

2016-04-25 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257006#comment-15257006
 ] 

Gayathri Murali commented on SPARK-14894:
-

[~wangmiao1981] I have PR ready for this. If you are okay, I can go ahead and 
submit that.

> Python GaussianMixture summary
> --
>
> Key: SPARK-14894
> URL: https://issues.apache.org/jira/browse/SPARK-14894
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.ml, GaussianMixture includes a result summary.  The Python API 
> should provide the same functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14314) K-means model persistence in SparkR

2016-04-21 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253227#comment-15253227
 ] 

Gayathri Murali commented on SPARK-14314:
-

[~mengxr] Yes.

> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14314) K-means model persistence in SparkR

2016-04-21 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252259#comment-15252259
 ] 

Gayathri Murali commented on SPARK-14314:
-

I am working on this


> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14315) GLMs model persistence in SparkR

2016-04-21 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252260#comment-15252260
 ] 

Gayathri Murali commented on SPARK-14315:
-

I am working on this

> GLMs model persistence in SparkR
> 
>
> Key: SPARK-14315
> URL: https://issues.apache.org/jira/browse/SPARK-14315
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14712) spark.ml LogisticRegressionModel.toString should summarize model

2016-04-18 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246600#comment-15246600
 ] 

Gayathri Murali commented on SPARK-14712:
-

_repr_ is defined for LabeledPoint and LinearModel in mllib.regression not with 
LogisticRegressionModel. Would you like to add this for LogisticRegressionModel 
in both ml and mllib? 

> spark.ml LogisticRegressionModel.toString should summarize model
> 
>
> Key: SPARK-14712
> URL: https://issues.apache.org/jira/browse/SPARK-14712
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> spark.mllib LogisticRegressionModel overrides toString to print a little 
> model info.  We should do the same in spark.ml.  I'd recommend:
> * super.toString
> * numClasses
> * numFeatures
> We should also override {{__repr__}} in pyspark to do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14604) Modify design of ML model summaries

2016-04-18 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246489#comment-15246489
 ] 

Gayathri Murali commented on SPARK-14604:
-

[~josephkb] I see that LogisticRegression has a evaluate method. Would you like 
to add a similar one to LinearRegressionModel and GLM? Also LogisticRegression 
Summary does not store model while Linear and GLM does. 

> Modify design of ML model summaries
> ---
>
> Key: SPARK-14604
> URL: https://issues.apache.org/jira/browse/SPARK-14604
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Several spark.ml models now have summaries containing evaluation metrics and 
> training info:
> * LinearRegressionModel
> * LogisticRegressionModel
> * GeneralizedLinearRegressionModel
> These summaries have unfortunately been added in an inconsistent way.  I 
> propose to reorganize them to have:
> * For each model, 1 summary (without training info) and 1 training summary 
> (with info from training).  The non-training summary can be produced for a 
> new dataset via {{evaluate}}.
> * A summary should not store the model itself.
> * A summary should provide a transient reference to the dataset used to 
> produce the summary.
> This task will involve reorganizing the GLM summary (which lacks a 
> training/non-training distinction) and deprecating the model method in the 
> LinearRegressionSummary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14503) spark.ml API for FPGrowth

2016-04-12 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237407#comment-15237407
 ] 

Gayathri Murali edited comment on SPARK-14503 at 4/12/16 3:44 PM:
--

[~josephkb] : [~yuhaoyan] and I can work on this. Will submit a design doc 
shortly


was (Author: gayathrimurali):
[~josephkb] [~yuhaoyan] and I can work on this. Will submit a design doc shortly

> spark.ml API for FPGrowth
> -
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14503) spark.ml API for FPGrowth

2016-04-12 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237407#comment-15237407
 ] 

Gayathri Murali commented on SPARK-14503:
-

[~josephkb] [~yuhaoyan] and I can work on this. Will submit a design doc shortly

> spark.ml API for FPGrowth
> -
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13783) Model export/import for spark.ml: GBTs

2016-03-25 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212442#comment-15212442
 ] 

Gayathri Murali commented on SPARK-13783:
-

Thanks [~josephkb]. I can go first, as I am almost done making changes. I could 
definitely review [~yanboliang]'s code and would really appreciate the same 
help. 

> Model export/import for spark.ml: GBTs
> --
>
> Key: SPARK-13783
> URL: https://issues.apache.org/jira/browse/SPARK-13783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for both GBTClassifier and GBTRegressor.  The implementation 
> should reuse the one for DecisionTree*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13783) Model export/import for spark.ml: GBTs

2016-03-24 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211407#comment-15211407
 ] 

Gayathri Murali commented on SPARK-13783:
-

[~yanboliang] I am working on Random Forest and I have similar options for 
discussion. One more suggestion here

1. Store each tree in single data frame, saved to a single parquet file. treeID 
added to the node data and tree reconstruction done using the pre order 
approach used in DecisionTree. Does this approach work? 

> Model export/import for spark.ml: GBTs
> --
>
> Key: SPARK-13783
> URL: https://issues.apache.org/jira/browse/SPARK-13783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for both GBTClassifier and GBTRegressor.  The implementation 
> should reuse the one for DecisionTree*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13733) Support initial weight distribution in personalized PageRank

2016-03-22 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207219#comment-15207219
 ] 

Gayathri Murali edited comment on SPARK-13733 at 3/22/16 8:28 PM:
--

[~mengxr] [~dwmclary] https://issues.apache.org/jira/browse/SPARK-5854 - 
mentions one key difference between page rank and personalized page rank as:
"In PageRank, every node has an initial score of 1, whereas for Personalized 
PageRank, only source node has a score of 1 and others have a score of 0 at the 
beginning.", which basically means we initialize as [1, 0, 0, 0, ..] where we 
set 1 only to the seed node. 

1. For this JIRA, do we want to instead set an initial distribution of weights 
such that all nodes receive non-zero initial values?  Could you clarify if this 
is the behavior that is intended?
2. What is the idea behind having an initial weight distribution for other 
vertices in personalized page rank? 





was (Author: gayathrimurali):
[~mengxr] [~dwmclary] https://issues.apache.org/jira/browse/SPARK-5854 - 
mentions one key difference between page rank and personalized page rank as:
"In PageRank, every node has an initial score of 1, whereas for Personalized 
PageRank, only source node has a score of 1 and others have a score of 0 at the 
beginning.", which basically means we initialize as [1, 0, 0, 0, ..] where we 
set 1 only to the seed node. 

For this JIRA, do we want to instead set an initial distribution of weights 
such that all nodes receive non-zero initial values?  Could you clarify if this 
is the behavior that is intended?




> Support initial weight distribution in personalized PageRank
> 
>
> Key: SPARK-13733
> URL: https://issues.apache.org/jira/browse/SPARK-13733
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Xiangrui Meng
>
> It would be nice to support personalized PageRank with an initial weight 
> distribution besides a single vertex. It should be easy to modify the current 
> implementation to add this support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13733) Support initial weight distribution in personalized PageRank

2016-03-22 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207219#comment-15207219
 ] 

Gayathri Murali commented on SPARK-13733:
-

[~mengxr] [~dwmclary] https://issues.apache.org/jira/browse/SPARK-5854 - 
mentions one key difference between page rank and personalized page rank as:
"In PageRank, every node has an initial score of 1, whereas for Personalized 
PageRank, only source node has a score of 1 and others have a score of 0 at the 
beginning.", which basically means we initialize as [1, 0, 0, 0, ..] where we 
set 1 only to the seed node. 

For this JIRA, do we want to instead set an initial distribution of weights 
such that all nodes receive non-zero initial values?  Could you clarify if this 
is the behavior that is intended?




> Support initial weight distribution in personalized PageRank
> 
>
> Key: SPARK-13733
> URL: https://issues.apache.org/jira/browse/SPARK-13733
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Xiangrui Meng
>
> It would be nice to support personalized PageRank with an initial weight 
> distribution besides a single vertex. It should be easy to modify the current 
> implementation to add this support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13733) Support initial weight distribution in personalized PageRank

2016-03-22 Thread Gayathri Murali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gayathri Murali updated SPARK-13733:

Comment: was deleted

(was: [~mengxr] Should the rest of the vertices also be set to resetProb(which 
is 0.25 initial weight) ?)

> Support initial weight distribution in personalized PageRank
> 
>
> Key: SPARK-13733
> URL: https://issues.apache.org/jira/browse/SPARK-13733
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Xiangrui Meng
>
> It would be nice to support personalized PageRank with an initial weight 
> distribution besides a single vertex. It should be easy to modify the current 
> implementation to add this support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13733) Support initial weight distribution in personalized PageRank

2016-03-19 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198327#comment-15198327
 ] 

Gayathri Murali commented on SPARK-13733:
-

[~mengxr] Should the rest of the vertices also be set to resetProb(which is 
0.25 initial weight) ?

> Support initial weight distribution in personalized PageRank
> 
>
> Key: SPARK-13733
> URL: https://issues.apache.org/jira/browse/SPARK-13733
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Xiangrui Meng
>
> It would be nice to support personalized PageRank with an initial weight 
> distribution besides a single vertex. It should be easy to modify the current 
> implementation to add this support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13949) PySpark ml DecisionTreeClassifier, Regressor support export/import

2016-03-19 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198240#comment-15198240
 ] 

Gayathri Murali commented on SPARK-13949:
-

I can work on this

> PySpark ml DecisionTreeClassifier, Regressor support export/import
> --
>
> Key: SPARK-13949
> URL: https://issues.apache.org/jira/browse/SPARK-13949
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>  Labels: starter
>
> Follow examples from other tasks in parent JIRA: Make the tree Estimator, 
> Model be MLReadable, MLWritable, and add doctests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13784) Model export/import for spark.ml: RandomForests

2016-03-13 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192537#comment-15192537
 ] 

Gayathri Murali commented on SPARK-13784:
-

I can work on this, if no one else has started

> Model export/import for spark.ml: RandomForests
> ---
>
> Key: SPARK-13784
> URL: https://issues.apache.org/jira/browse/SPARK-13784
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for both RandomForestClassifier and RandomForestRegressor.  The 
> implementation should reuse the one for DecisionTree*.
> It should augment NodeData with a tree ID so that all nodes can be stored in 
> a single DataFrame.  It should reconstruct the trees in a distributed fashion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2016-03-08 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186074#comment-15186074
 ] 

Gayathri Murali commented on SPARK-12664:
-

[~yanboliang] Are you working on this? If not, Can I work on this?

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13733) Support initial weight distribution in personalized PageRank

2016-03-08 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185945#comment-15185945
 ] 

Gayathri Murali commented on SPARK-13733:
-

I can work on this

> Support initial weight distribution in personalized PageRank
> 
>
> Key: SPARK-13733
> URL: https://issues.apache.org/jira/browse/SPARK-13733
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Xiangrui Meng
>
> It would be nice to support personalized PageRank with an initial weight 
> distribution besides a single vertex. It should be easy to modify the current 
> implementation to add this support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-05 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181819#comment-15181819
 ] 

Gayathri Murali commented on SPARK-13641:
-

[~xusen] Can you list the steps to reproduce the bug? 

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-04 Thread Gayathri Murali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gayathri Murali updated SPARK-13073:

Comment: was deleted

(was: I can work on this, can you please assign it to me?)

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-04 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181369#comment-15181369
 ] 

Gayathri Murali commented on SPARK-13641:
-

Yes, it looks intentional to carry metadata. There are multiple ways in which 
it can stripped out. I will let [~mengxr] to answer if its ok to strip out the 
"c_" from feature names in the SparkRWrapper

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-03 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179241#comment-15179241
 ] 

Gayathri Murali commented on SPARK-13641:
-

[~xusen] I can remove c_feature from Vector Assembler code, but is the naming 
convention intentional?  

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-03-03 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179180#comment-15179180
 ] 

Gayathri Murali commented on SPARK-13629:
-

[~josephkb] In the case of discrete probabilistic models in which this is most 
applicable, would the min_df and max_df thresholds change? or only the word 
count should be set to 0 or 1 depending on the value of binary? 

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-03 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178517#comment-15178517
 ] 

Gayathri Murali commented on SPARK-13641:
-

I can work on this. Can you please assign it to me


> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13025) Allow user to specify the initial model when training LogisticRegression

2016-03-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175006#comment-15175006
 ] 

Gayathri Murali commented on SPARK-13025:
-

https://github.com/apache/spark/pull/11459

> Allow user to specify the initial model when training LogisticRegression
> 
>
> Key: SPARK-13025
> URL: https://issues.apache.org/jira/browse/SPARK-13025
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
>
> Allow the user to set the initial model when training for logistic 
> regression. Note the method already exists, just change visibility to public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13025) Allow user to specify the initial model when training LogisticRegression

2016-03-01 Thread Gayathri Murali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gayathri Murali updated SPARK-13025:

Comment: was deleted

(was: PR : https://github.com/apache/spark/pull/11458)

> Allow user to specify the initial model when training LogisticRegression
> 
>
> Key: SPARK-13025
> URL: https://issues.apache.org/jira/browse/SPARK-13025
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
>
> Allow the user to set the initial model when training for logistic 
> regression. Note the method already exists, just change visibility to public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13025) Allow user to specify the initial model when training LogisticRegression

2016-03-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174983#comment-15174983
 ] 

Gayathri Murali commented on SPARK-13025:
-

PR : https://github.com/apache/spark/pull/11458

> Allow user to specify the initial model when training LogisticRegression
> 
>
> Key: SPARK-13025
> URL: https://issues.apache.org/jira/browse/SPARK-13025
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
>
> Allow the user to set the initial model when training for logistic 
> regression. Note the method already exists, just change visibility to public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174943#comment-15174943
 ] 

Gayathri Murali commented on SPARK-13073:
-

I can work on this, can you please assign it to me?

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-02-29 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173163#comment-15173163
 ] 

Gayathri Murali edited comment on SPARK-6160 at 3/1/16 3:57 AM:


[~josephkb] How should the test stat results be persisted? 

Option 1 :  is to persist the test stats as data member of ChiSqSelectorModel 
class, in which case it needs to be initialized using the constructor.This 
would mean modifying the code where the object of the class is being 
instantiated. 

Option 2 : To create an auxiliary class which would just store the test stat 
results. 


was (Author: gayathrimurali):
[~josephkb] Should the test statistics result be stored as a text/parquet file? 
or Can it just be stored in a local array? 

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-02-29 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173163#comment-15173163
 ] 

Gayathri Murali edited comment on SPARK-6160 at 3/1/16 3:53 AM:


[~josephkb] Should the test statistics result be stored as a text/parquet file? 
or Can it just be stored in a local array? 


was (Author: gayathrimurali):
[~josephkb] Should the test statistics result be stored as a text/parquet file? 
or Can it just be stored in a local array? 

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-02-29 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173163#comment-15173163
 ] 

Gayathri Murali commented on SPARK-6160:


[~josephkb] Should the test statistics result be stored as a text/parquet file? 
or Can it just be stored in a local array? 

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-02-24 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166419#comment-15166419
 ] 

Gayathri Murali commented on SPARK-6160:


Is anyone working on this? If not, I can.

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13396) Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates

2016-02-24 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163925#comment-15163925
 ] 

Gayathri Murali commented on SPARK-13396:
-

I can work on this

> Stop using our internal deprecated .metrics on ExceptionFailure instead use 
> accumUpdates
> 
>
> Key: SPARK-13396
> URL: https://issues.apache.org/jira/browse/SPARK-13396
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala:385: value 
> metrics in class ExceptionFailure is deprecated: use accumUpdates instead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13174) Add API and options for csv data sources

2016-02-09 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139900#comment-15139900
 ] 

Gayathri Murali commented on SPARK-13174:
-

There is already a way to read CSV files by specifying the delimiters. Can you 
elaborate a little bit more on the component that needs to have this feature? 

> Add API and options for csv data sources
> 
>
> Key: SPARK-13174
> URL: https://issues.apache.org/jira/browse/SPARK-13174
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output
>Reporter: Davies Liu
>
> We should have a API to load csv data source (with some options as 
> arguments), similar to json() and jdbc()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12729) phantom references to replace the finalize call in python broadcast

2016-02-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127164#comment-15127164
 ] 

Gayathri Murali edited comment on SPARK-12729 at 2/1/16 11:28 PM:
--

I can work on this, if no one else is. Can you specify affected Spark version? 


was (Author: gayathrimurali):
I can work on this, if no one else is. 

> phantom references to replace the finalize call in python broadcast
> ---
>
> Key: SPARK-12729
> URL: https://issues.apache.org/jira/browse/SPARK-12729
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>
> it is doing IO operations and blocking the GC thread, 
> see 
> http://resources.ej-technologies.com/jprofiler/help/doc/index.html#jprofiler.helptopics.cpu.finalizers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12729) phantom references to replace the finalize call in python broadcast

2016-02-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127164#comment-15127164
 ] 

Gayathri Murali commented on SPARK-12729:
-

I can work on this, if no one else is. 

> phantom references to replace the finalize call in python broadcast
> ---
>
> Key: SPARK-12729
> URL: https://issues.apache.org/jira/browse/SPARK-12729
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>
> it is doing IO operations and blocking the GC thread, 
> see 
> http://resources.ej-technologies.com/jprofiler/help/doc/index.html#jprofiler.helptopics.cpu.finalizers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType

2015-12-21 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066743#comment-15066743
 ] 

Gayathri Murali commented on SPARK-12436:
-

I can work on this

> If all values of a JSON field is null, JSON's inferSchema should return 
> NullType instead of StringType
> --
>
> Key: SPARK-12436
> URL: https://issues.apache.org/jira/browse/SPARK-12436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: starter
>
> Right now, JSON's inferSchema will return {{StringType}} for a field that 
> always has null values or an {{ArrayType(StringType)}}  for a field that 
> always has empty array values. Although this behavior makes writing JSON data 
> to other data sources easy (i.e. when writing data, we do not need to remove 
> those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream 
> application hard to reason about the actual schema of the data and thus makes 
> schema merging hard. We should allow JSON's inferSchema returns {{NullType}} 
> and {{ArrayType(NullType)}}. Also, we need to make sure that when we write 
> data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} 
> columns first. 
> Besides  {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same 
> thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). 
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and 
> {{ArrayType(NullType)}} columns from the data that will be write out for all 
> data sources (i.e. data sources based our data source API and Hive tables). 
> Or, we should just add this operation for certain data sources (e.g. 
> Parquet). For example, we may not need this operation for Hive because Hive 
> has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5818) unable to use "add jar" in hql

2015-10-20 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966065#comment-14966065
 ] 

Gayathri Murali commented on SPARK-5818:


I am trying to use add jar command on Spark 1.5 and run into issues. The jar is 
registered with the local maven repo. 

hivecontext.sql("ADD JAR 
'/Users/x/.m2/repository/org/apache/hadoop/hive/serde2/TestSerDe/1.0/TestSerDe-1.0.jar'")

This path is correct and the jar file is present in the path. But I am getting 
the following error

ERROR SparkContext: Jar not found at 
'/Users/x/.m2/repository/org/apache/hadoop/hive/serde2/TestSerDe/1.0/TestSerDe-1.0.jar'

However when i add this jar with --jars during spark-submit, it works fine 




> unable to use "add jar" in hql
> --
>
> Key: SPARK-5818
> URL: https://issues.apache.org/jira/browse/SPARK-5818
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1
>Reporter: pengxu
>
> In the spark 1.2.1 and 1.2.0, it's unable the use the hive command "add jar"  
> in hql.
> It seems that the problem in spark-2219 is still existed.
> the problem can be reproduced as described in the below. Suppose the jar file 
> is named brickhouse-0.6.0.jar and is placed in the /tmp directory
> {code}
> spark-shell>import org.apache.spark.sql.hive._
> spark-shell>val sqlContext = new HiveContext(sc)
> spark-shell>import sqlContext._
> spark-shell>hql("add jar /tmp/brickhouse-0.6.0.jar")
> {code}
> the error message is showed as blow
> {code:title=Error Log}
> 15/02/15 01:36:31 ERROR SessionState: Unable to register 
> /tmp/brickhouse-0.6.0.jar
> Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be 
> cast to java.net.URLClassLoader
> java.lang.ClassCastException: 
> org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to 
> java.net.URLClassLoader
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717)
>   at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54)
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
>   at 
> org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74)
>   at 
> org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
>   at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
>   at 
> org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
>   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
>   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102)
>   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106)
>   at 
> $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at 
> $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
>   at 
> $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
>   at $line30.$read$$iwC$$iwC$$iwC$$iwC.(:39)
>   at $line30.$read$$iwC$$iwC$$iwC.(:41)
>   at $line30.$read$$iwC$$iwC.(:43)
>   at $line30.$read$$iwC.(:45)
>   at $line30.$read.(:47)
>   at $line30.$read$.(:51)
>   at $line30.$read$.()
>   at $line30.$eval$.(:7)
>   at $line30.$eval$.()
>   at $line30.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRu

[jira] [Commented] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2015-10-06 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946026#comment-14946026
 ] 

Gayathri Murali commented on SPARK-10954:
-

I can work on this


> Parquet version in the "created_by" metadata field of Parquet files written 
> by Spark 1.5 and 1.6 is wrong
> -
>
> Key: SPARK-10954
> URL: https://issues.apache.org/jira/browse/SPARK-10954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Priority: Minor
>
> We've upgraded to parquet-mr 1.7.0 in Spark 1.5, but the {{created_by}} field 
> still says 1.6.0. This issue can be reproduced by generating any Parquet file 
> with Spark 1.5, and then check the metadata with {{parquet-meta}} CLI tool:
> {noformat}
> $ parquet-meta /tmp/parquet/dec
> file:
> file:/tmp/parquet/dec/part-r-0-f210e968-1be5-40bc-bcbc-007f935e6dc7.gz.parquet
> creator: parquet-mr version 1.6.0
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"dec","type":"decimal(20,2)","nullable":true,"metadata":{}}]}
> file schema: spark_schema
> -
> dec: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1: RC:10 TS:140 OFFSET:4
> -
> dec:  FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:4 SZ:99/140/1.41 VC:10 
> ENC:PLAIN,BIT_PACKED,RLE
> {noformat}
> Note that this field is written by parquet-mr rather than Spark. However, 
> writing Parquet files using parquet-mr 1.7.0 directly without Spark 1.5 only 
> shows {{parquet-mr}} without any version number. Files written by parquet-mr 
> 1.8.1 without Spark look fine though.
> Currently this isn't a big issue. But parquet-mr 1.8 checks for this field to 
> workaround PARQUET-251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle

2015-10-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940133#comment-14940133
 ] 

Gayathri Murali commented on SPARK-7448:


Is anyone working on this ? If not, I would like to work on this. 

> Implement custom bye array serializer for use in PySpark shuffle
> 
>
> Key: SPARK-7448
> URL: https://issues.apache.org/jira/browse/SPARK-7448
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Shuffle
>Reporter: Josh Rosen
>Priority: Minor
>
> PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We 
> should implement a custom Serializer for use in these shuffles.  This will 
> allow us to take advantage of shuffle optimizations like SPARK-7311 for 
> PySpark without requiring users to change the default serializer to 
> KryoSerializer (this is useful for JobServer-type applications).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7218) Create a real iterator with open/close for Spark SQL

2015-10-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940116#comment-14940116
 ] 

Gayathri Murali commented on SPARK-7218:


Is anyone working on this? If not, I would like to work on it. Can someone 
point to a little bit more details on this JIRA?

> Create a real iterator with open/close for Spark SQL
> 
>
> Key: SPARK-7218
> URL: https://issues.apache.org/jira/browse/SPARK-7218
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9656) Add missing methods to linalg.distributed

2015-10-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940088#comment-14940088
 ] 

Gayathri Murali commented on SPARK-9656:


[~mwdus...@us.ibm.com] Are you working on this? I would like to work on some of 
these functions if you havent started already.  

> Add missing methods to linalg.distributed
> -
>
> Key: SPARK-9656
> URL: https://issues.apache.org/jira/browse/SPARK-9656
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> Missing methods in linalg.distributed
> RowMatrix
> 1. computeGramianMatrix
> 2. computeCovariance
> 3. computeColumnSummaryStatistics
> 4. columnSimilarities
> 5. tallSkinnyQR
> IndexedRowMatrix
> 1. computeGramianMatrix()
> CoordinateMatrix
> 1. transpose()
> BlockMatrix (these may be able to be rolled into 6488)
> 1. validate()
> 2. cache()
> 3. persist()
> 4. transpose()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-28 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933597#comment-14933597
 ] 

Gayathri Murali commented on SPARK-10688:
-

Sure, will do

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-24 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906583#comment-14906583
 ] 

Gayathri Murali commented on SPARK-10688:
-

I started working on it as well. Since there isnt a way to ensure exclusivity 
we should go with first come first served. 

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-22 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903357#comment-14903357
 ] 

Gayathri Murali commented on SPARK-10688:
-

If there isn't anyone working on it, i would like to work on this

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-09-07 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734070#comment-14734070
 ] 

Gayathri Murali commented on SPARK-7425:


Has this been fixed? If not, I am starting out and would like to send a PR for 
this one.

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10380) Confusing examples in pyspark SQL docs

2015-09-07 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733993#comment-14733993
 ] 

Gayathri Murali commented on SPARK-10380:
-

Is anyone working on this already? If not, I would like to send a pull request.

> Confusing examples in pyspark SQL docs
> --
>
> Key: SPARK-10380
> URL: https://issues.apache.org/jira/browse/SPARK-10380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>  Labels: docs, starter
>
> There’s an error in the astype() documentation, as it uses cast instead of 
> astype. It should probably include a mention that astype is an alias for cast 
> (and vice versa in the cast documentation): 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.astype
>  
> The same error occurs with drop_duplicates and dropDuplicates: 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop_duplicates
>  
> The issue here is we are copying the code.  According to [~davies] the 
> easiest way is to copy the method and just add new docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org