[jira] [Commented] (SPARK-1548) Add Partial Random Forest algorithm to MLlib

2017-04-18 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972474#comment-15972474
 ] 

Mohamed Baddar commented on SPARK-1548:
---

[~srowen] [~josephkb] any updates on the possibility of proceeding with this 
issue ?

> Add Partial Random Forest algorithm to MLlib
> 
>
> Key: SPARK-1548
> URL: https://issues.apache.org/jira/browse/SPARK-1548
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>
> This task involves creating an alternate approximate random forest 
> implementation where each tree is constructed per partition.
> The tasks involves:
> - Justifying with theory and experimental results why this algorithm is a 
> good choice.
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1548) Add Partial Random Forest algorithm to MLlib

2017-03-24 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940130#comment-15940130
 ] 

Mohamed Baddar commented on SPARK-1548:
---

[~manishamde] [~sowen] [~josephkb] 
I have small experience in contributions on starter tasks in spark, and found 
this issue interesting. I was investigating regarding the partial 
implementation of RF, and found these resources:

https://mahout.apache.org/users/classification/partial-implementation.html
https://github.com/apache/mahout/blob/b5fe4aab22e7867ae057a6cdb1610cfa17555311/mr/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/package-info.java

I thinks analyzing mahout implementation provides a good basis to start 
analyzing RF partial implementation in theory and practically. If this issue is 
still important to Spark, It would be great if I can start on it. I can start 
with creating analysis document for current mahout implementation to assess its 
performance

> Add Partial Random Forest algorithm to MLlib
> 
>
> Key: SPARK-1548
> URL: https://issues.apache.org/jira/browse/SPARK-1548
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>
> This task involves creating an alternate approximate random forest 
> implementation where each tree is constructed per partition.
> The tasks involves:
> - Justifying with theory and experimental results why this algorithm is a 
> good choice.
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2017-03-21 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934864#comment-15934864
 ] 

Mohamed Baddar commented on SPARK-7129:
---

[~josephkb] [~sethah] [~meihuawu] [~mlnick] If now one is working on this. Can 
I start working on it, I have small experience in contributing with starter 
tasks in Spark. If no one working on it I would love to start reading the 
design docs mentioned in comments and start discussing next steps


> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2016-07-24 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391082#comment-15391082
 ] 

Mohamed Baddar commented on SPARK-3246:
---

[~sheridanrawlins] Working on it soon, most probably on 1st of August

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-05-03 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268400#comment-15268400
 ] 

Mohamed Baddar commented on SPARK-13073:


[~samsudhin] I will work on it soon

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14077) Support weighted instances in naive Bayes

2016-04-25 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256247#comment-15256247
 ] 

Mohamed Baddar commented on SPARK-14077:


I suspended working on it for the time being

> Support weighted instances in naive Bayes
> -
>
> Key: SPARK-14077
> URL: https://issues.apache.org/jira/browse/SPARK-14077
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>  Labels: naive-bayes
>
> In naive Bayes, we expect inputs to be individual observations. In practice, 
> people may have the frequency table instead. It is useful for us to support 
> instance weights to handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14077) Support weighted instances in naive Bayes

2016-03-27 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213481#comment-15213481
 ] 

Mohamed Baddar commented on SPARK-14077:


[~mengxr] [~josephkb] In sktlearn code , they implement the same feature by 
scaling the target variable after binarization. Here's the source code link 
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/naive_bayes.py#L507.
 I think we can follow sktlearn implementation as a guideline and it will also 
help in the unit test. Any thoughts ?

> Support weighted instances in naive Bayes
> -
>
> Key: SPARK-14077
> URL: https://issues.apache.org/jira/browse/SPARK-14077
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>  Labels: naive-bayes
>
> In naive Bayes, we expect inputs to be individual observations. In practice, 
> people may have the frequency table instead. It is useful for us to support 
> instance weights to handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-24 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210259#comment-15210259
 ] 

Mohamed Baddar commented on SPARK-13073:


Thanks [~samsudhin] I noticed the difference in params. Do you have any other 
comments on my notes

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14077) Support weighted instances in naive Bayes

2016-03-24 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210069#comment-15210069
 ] 

Mohamed Baddar commented on SPARK-14077:


[~mengxr] If no body is working on this task , Can i work on it ?

> Support weighted instances in naive Bayes
> -
>
> Key: SPARK-14077
> URL: https://issues.apache.org/jira/browse/SPARK-14077
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>  Labels: naive-bayes
>
> In naive Bayes, we expect inputs to be individual observations. In practice, 
> people may have the frequency table instead. It is useful for us to support 
> instance weights to handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-24 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210054#comment-15210054
 ] 

Mohamed Baddar commented on SPARK-13073:


[~josephkb] Can any one of the admins verify this PR

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1359) SGD implementation is not efficient

2016-03-16 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197062#comment-15197062
 ] 

Mohamed Baddar commented on SPARK-1359:
---

[~mengxr] If this issue is still of interest and nobody is working on it , I 
can start implementation.

> SGD implementation is not efficient
> ---
>
> Key: SPARK-1359
> URL: https://issues.apache.org/jira/browse/SPARK-1359
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Xiangrui Meng
>
> The SGD implementation samples a mini-batch to compute the stochastic 
> gradient. This is not efficient because examples are provided via an iterator 
> interface. We have to scan all of them to obtain a sample.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2016-03-15 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195242#comment-15195242
 ] 

Mohamed Baddar commented on SPARK-3246:
---

[~josephkb] If nobody working on that issue and it is still of interest , I can 
work on it 

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9134) LDA Asymmetric topic-word prior

2016-03-15 Thread Mohamed Baddar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohamed Baddar updated SPARK-9134:
--
Comment: was deleted

(was: [~josephkb] [~fliang] If no body working on that , and there is an 
interest in that issue , can i start working on it ?)

> LDA Asymmetric topic-word prior
> ---
>
> Key: SPARK-9134
> URL: https://issues.apache.org/jira/browse/SPARK-9134
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Feynman Liang
>
> SPARK-8536 generalizes LDA to asymmetric document-topic priors, which 
> [Wallach et al|http://dirichlet.net/pdf/wallach09rethinking.pdf] proposes 
> offers greater utility in terms of asymmetric priors.
> However, [Stanford 
> NLP|http://nlp.stanford.edu/software/tmt/tmt-0.2/scaladocs/scaladocs/edu/stanford/nlp/tmt/lda/LDA.html]
>  also permits asymmetric priors on the topic-word prior. We should not 
> support manually specifying the entire matrix (which has numTopics * 
> vocabSize entries); rather we should follow Stanford NLP and take a single 
> vector of length vocabSize for a prior over words and assume that all topics 
> share this prior (e.g. replicate it numTopics times to get the topic-word 
> prior matrix).
> We are leaving this as todo; any users who have a need for this feature 
> should discuss on this JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9134) LDA Asymmetric topic-word prior

2016-03-15 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195217#comment-15195217
 ] 

Mohamed Baddar commented on SPARK-9134:
---

[~josephkb] [~fliang] If no body working on that , and there is an interest in 
that issue , can i start working on it ?

> LDA Asymmetric topic-word prior
> ---
>
> Key: SPARK-9134
> URL: https://issues.apache.org/jira/browse/SPARK-9134
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Feynman Liang
>
> SPARK-8536 generalizes LDA to asymmetric document-topic priors, which 
> [Wallach et al|http://dirichlet.net/pdf/wallach09rethinking.pdf] proposes 
> offers greater utility in terms of asymmetric priors.
> However, [Stanford 
> NLP|http://nlp.stanford.edu/software/tmt/tmt-0.2/scaladocs/scaladocs/edu/stanford/nlp/tmt/lda/LDA.html]
>  also permits asymmetric priors on the topic-word prior. We should not 
> support manually specifying the entire matrix (which has numTopics * 
> vocabSize entries); rather we should follow Stanford NLP and take a single 
> vector of length vocabSize for a prior over words and assume that all topics 
> share this prior (e.g. replicate it numTopics times to get the topic-word 
> prior matrix).
> We are leaving this as todo; any users who have a need for this feature 
> should discuss on this JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-13 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192290#comment-15192290
 ] 

Mohamed Baddar edited comment on SPARK-13073 at 3/13/16 11:03 AM:
--

[~josephkb] After more investigation in the code , and to make minimal changes 
on the code.My previous suggestion may not be suitable .I think we can 
implement toString version for BinaryLogisticRegressionSummary that give 
different information than R summary. It will create string representation for 
the following members :
precision 
recall
fmeasure
 is there any comment before i start the PR ?


was (Author: mbaddar1):
[~josephkb] After more investigation in the code , and to make minimal changes 
on the code.My previous suggestion may not be suitable .I think we can 
implement toString version for BinaryLogisticRegressionSummary that give 
different information than R summary. It will create string representation for 
the following members :
precision 
recall
fmeasure
[~josephkb] is there any comment before i start the PR ?

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-13 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192290#comment-15192290
 ] 

Mohamed Baddar commented on SPARK-13073:


[~josephkb] After more investigation in the code , and to make minimal changes 
on the code.My previous suggestion may not be suitable .I think we can 
implement toString version for BinaryLogisticRegressionSummary that give 
different information than R summary. It will create string representation for 
the following members :
precision 
recall
fmeasure
[~josephkb] is there any comment before i start the PR ?

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-10 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189283#comment-15189283
 ] 

Mohamed Baddar commented on SPARK-13073:


[~josephkb] After looking at source code of 
org.apache.spark.ml.classification.LogisticRegressionSummary and 
org.apache.spark.ml.classification.LogisticRegressionTrainingSummary

and after running a sample GLM in R which has the following output 

Call:
glm(formula = mpg ~ wt + hp + gear, family = gaussian(), data = mtcars)

Deviance Residuals: 
Min   1Q   Median   3Q  Max  
-3.3712  -1.9017  -0.3444   0.9883   6.0655  

Coefficients:
 Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.013657   4.632264   6.911 1.64e-07 ***
wt  -3.197811   0.846546  -3.777 0.000761 ***
hp  -0.036786   0.009891  -3.719 0.000888 ***
gear 1.019981   0.851408   1.198 0.240963
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 6.626347)

Null deviance: 1126.05  on 31  degrees of freedom
Residual deviance:  185.54  on 28  degrees of freedom
AIC: 157.05

Number of Fisher Scoring iterations: 2

I have the following comments :
1-I think we should add the following member to LogisticRegressionSummary : 
coefficients and residuals

2-toString method should be overridden in the following classes :
org.apache.spark.ml.classification.BinaryLogisticRegressionSummary and 
org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary

Any other suggestions ? Please correct me if have missed something.

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-09 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187935#comment-15187935
 ] 

Mohamed Baddar commented on SPARK-13073:


[~josephkb] Can you assign this to me as a starter task ?

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-09 Thread Mohamed Baddar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohamed Baddar updated SPARK-13073:
---
Comment: was deleted

(was: [~josephkb] If no body is working on it , can i start working on that 
issue as a starter task ?)

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-07 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183575#comment-15183575
 ] 

Mohamed Baddar commented on SPARK-13073:


[~josephkb] If no body is working on it , can i start working on that issue as 
a starter task ?

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance

2015-10-13 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955408#comment-14955408
 ] 

Mohamed Baddar commented on SPARK-10791:


[~aspa] would you please clarify the specific thread in the link 
https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/browser
which discusses this performance issue as i am working on [SPARK-10808]

> Optimize MLlib LDA topic distribution query performance
> ---
>
> Key: SPARK-10791
> URL: https://issues.apache.org/jira/browse/SPARK-10791
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
> Environment: Ubuntu 13.10, Oracle Java 8
>Reporter: Marko Asplund
>
> I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size 
> and ~3.4 M documents using EMLDAOptimizer.
> Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit 
> training with the same data and on the same system set took ~5 minutes. 
> Loading the persisted model from disk (~2 minutes), as well as querying LDA 
> model topic distributions (~4 seconds for one document) are also quite slow 
> operations.
> Our application is querying LDA model topic distribution (for one doc at a 
> time) as part of end-user operation execution flow, so a ~4 second execution 
> time is very problematic.
> The log includes the following message, which AFAIK, should mean that 
> netlib-java is using machine optimised native implementation: 
> "com.github.fommil.jni.JniLoader - successfully loaded 
> /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so"
> My test code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57
> I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable 
> change in training performance. Model loading time was reduced to ~ 5 seconds 
> from ~ 2 minutes (now persisted as LocalLDAModel). However, query / 
> prediction time was unchanged.
> Unfortunately, this is the critical performance characteristic in our case.
> I did some profiling for my LDA prototype code that requests topic 
> distributions from a model. According to Java Mission Control more than 80 % 
> of execution time during sample interval is spent in the following methods:
> - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
> - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
> - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
> 6.98%
> - java.lang.Double.valueOf(double); count: 31; 4.33%
> Is there any way of using the API more optimally?
> Are there any opportunities for optimising the "topicDistributions" code
> path in MLlib?
> My query test code looks like this essentially:
> // executed once
> val model = LocalLDAModel.load(ctx, ModelFileName)
> // executed four times
> val samples = Transformers.toSparseVectors(vocabularySize,
> ctx.parallelize(Seq(input))) // fast
> model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
> seems to take about 4 seconds to execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10808) LDA user guide: discuss running time of LDA

2015-09-25 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908198#comment-14908198
 ] 

Mohamed Baddar commented on SPARK-10808:


Thanks [~josephkb] , working on it

> LDA user guide: discuss running time of LDA
> ---
>
> Key: SPARK-10808
> URL: https://issues.apache.org/jira/browse/SPARK-10808
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Based on feedback like [SPARK-10791], we should discuss the computational and 
> communication complexity of LDA and its optimizers in the MLlib Programming 
> Guide.  E.g.:
> * Online LDA can be faster than EM.
> * To make online LDA run faster, you can use a smaller miniBatchFraction.
> * Communication
> ** For EM, communication on each iteration is on the order of # topics * 
> (vocabSize + # docs).
> ** For online LDA, communication on each iteration is on the order of # 
> topics * vocabSize.
> * Decreasing vocabSize and # topics can speed things up.  It's often fine to 
> eliminate uncommon words, unless you are trying to create a very large number 
> of topics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10808) LDA user guide: discuss running time of LDA

2015-09-24 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906975#comment-14906975
 ] 

Mohamed Baddar commented on SPARK-10808:


Hello [~josephkb] , can i take this task . thanks

> LDA user guide: discuss running time of LDA
> ---
>
> Key: SPARK-10808
> URL: https://issues.apache.org/jira/browse/SPARK-10808
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Based on feedback like [SPARK-10791], we should discuss the computational and 
> communication complexity of LDA and its optimizers in the MLlib Programming 
> Guide.  E.g.:
> * Online LDA can be faster than EM.
> * To make online LDA run faster, you can use a smaller miniBatchFraction.
> * Communication
> ** For EM, communication on each iteration is on the order of # topics * 
> (vocabSize + # docs).
> ** For online LDA, communication on each iteration is on the order of # 
> topics * vocabSize.
> * Decreasing vocabSize and # topics can speed things up.  It's often fine to 
> eliminate uncommon words, unless you are trying to create a very large number 
> of topics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-09-23 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904471#comment-14904471
 ] 

Mohamed Baddar commented on SPARK-9836:
---

Thanks a lot , i will try one of the starter tasks , but seems they are all 
taken , if so , what should i do next ?

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9798) CrossValidatorModel Documentation Improvements

2015-09-23 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904467#comment-14904467
 ] 

Mohamed Baddar commented on SPARK-9798:
---

Hello rerngvit
I am also new to contribution , can we work together , or split this task into 
other subtasks , to help us both get involved 
Thanks

> CrossValidatorModel Documentation Improvements
> --
>
> Key: SPARK-9798
> URL: https://issues.apache.org/jira/browse/SPARK-9798
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>  Labels: starter
>
> CrossValidatorModel's avgMetrics and bestModel need documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-09-23 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904471#comment-14904471
 ] 

Mohamed Baddar edited comment on SPARK-9836 at 9/23/15 8:39 PM:


Thanks a lot [~mengxr] , i will try one of the starter tasks , but seems they 
are all taken , if so , what should i do next ?


was (Author: mbaddar):
Thanks a lot , i will try one of the starter tasks , but seems they are all 
taken , if so , what should i do next ?

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9835) Iteratively reweighted least squares solver for GLMs

2015-09-21 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900694#comment-14900694
 ] 

Mohamed Baddar commented on SPARK-9835:
---

Can I work on this issue 
Thanks

> Iteratively reweighted least squares solver for GLMs
> 
>
> Key: SPARK-9835
> URL: https://issues.apache.org/jira/browse/SPARK-9835
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-9834, we can implement iteratively reweighted least squares 
> (IRLS) solver for GLMs with other families and link functions. It could 
> provide R-like summary statistics after training, but the number of features 
> cannot be very large, e.g. more than 4096.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-09-21 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900678#comment-14900678
 ] 

Mohamed Baddar commented on SPARK-9836:
---

Hello , Can i be assigned to This Task
Thanks

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org