[jira] [Commented] (SPARK-7008) An Implement of Factorization Machine (LibFM)

2015-04-20 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504114#comment-14504114
 ] 

zhengruifeng commented on SPARK-7008:
-

thanks for this information!

 An Implement of Factorization Machine (LibFM)
 -

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch

 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-04-21 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-7008:

Summary: An implementation of Factorization Machine (LibFM)  (was: An 
Implement of Factorization Machine (LibFM))

 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch
 Attachments: FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7008) An Implement of Factorization Machine (LibFM)

2015-04-20 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-7008:

Description: 
An implement of Factorization Machines based on Scala and Spark MLlib.
Factorization Machine is a kind of machine learning algorithm for multi-linear 
regression, and is widely used for recommendation.
Factorization Machines works well in recent years' recommendation competitions.

Ref:
http://libfm.org/
http://doi.acm.org/10.1145/2168752.2168771
http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf


  was:
An implementation of Factorization Machines based on Scala and Spark MLlib.
Factorization Machine is a kind of machine learning algorithm for multi-linear 
regression, and is widely used for recommendation.
Factorization Machines works well in recent years' recommendation competitions.

Ref:
http://libfm.org/
http://doi.acm.org/10.1145/2168752.2168771
http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf


Summary: An Implement of Factorization Machine (LibFM)  (was: Implement 
of Factorization Machine (LibFM))

 An Implement of Factorization Machine (LibFM)
 -

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch

 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7008) Implement of Factorization Machine (LibFM)

2015-04-20 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-7008:

Description: 
An implementation of Factorization Machines based on Scala and Spark MLlib.
Factorization Machine is a kind of machine learning algorithm for multi-linear 
regression, and is widely used for recommendation.
Factorization Machines works well in recent years' recommendation competitions.

Ref:
http://libfm.org/
http://doi.acm.org/10.1145/2168752.2168771
http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf


  was:
An implementation of Factorization Machines based on Scala and Spark MLlib.
Factorization Machine is a kind of machine learning algorithm for multi-linear 
regression, and is widely used for recommendation.
FM work well in recent years' recommendation competitions.

Ref:
http://libfm.org/
http://doi.acm.org/10.1145/2168752.2168771
http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



 Implement of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: zhengruifeng
  Labels: features

 An implementation of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7008) Implement of Factorization Machine (LibFM)

2015-04-20 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-7008:

Labels: features patch  (was: features)

 Implement of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: zhengruifeng
  Labels: features, patch

 An implementation of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7008) Implement of Factorization Machine (LibFM)

2015-04-20 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-7008:

Affects Version/s: 1.3.2
   1.3.1

 Implement of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch

 An implementation of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7008) Implement of Factorization Machine (LibFM)

2015-04-20 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-7008:

Target Version/s: 1.3.0, 1.3.1, 1.3.2  (was: 1.3.0)

 Implement of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch

 An implementation of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7008) Implement of Factorization Machine (LibFM)

2015-04-20 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-7008:
---

 Summary: Implement of Factorization Machine (LibFM)
 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: zhengruifeng


An implementation of Factorization Machines based on Scala and Spark MLlib.
Factorization Machine is a kind of machine learning algorithm for multi-linear 
regression, and is widely used for recommendation.
FM work well in recent years' recommendation competitions.

Ref:
http://libfm.org/
http://doi.acm.org/10.1145/2168752.2168771
http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-04-21 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504596#comment-14504596
 ] 

zhengruifeng commented on SPARK-7008:
-

I had not considered of the size of model, because the problems which I usualy 
encounter have dimensionality less than 10 millions. In the situation of higher 
dimensionality, I think feature hashing may help to limit the number of 
features (not sure).
The libFM had implemented four training algorithms: SGD, AdaptiveSGD, ALS and 
MCC. I have only implemented the SGD for regression, and I'm to carry out SGD 
for binary classification.
In my opinion, SGD is sensitive to the learning rate: big values cause 
divergency while small cause long-time training.
When coding, I strictly refers to LibFM. There are only two points different: 
LibFM use strict SGD, I use mini-batch SGD provided by MLlib; LibFM use 
Learning Rate as a constant, I make it decreasing with the square root of the 
iteration counter. So I think it's convergence may like LibFM's SGD.
I'm testing the library, and the result will be post in several days.
Thanks.

 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch
 Attachments: FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-04-24 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-7008:

Description: 
An implementation of Factorization Machines based on Scala and Spark MLlib.
FM is a kind of machine learning algorithm for multi-linear regression, and is 
widely used for recommendation.
FM works well in recent years' recommendation competitions.

Ref:
http://libfm.org/
http://doi.acm.org/10.1145/2168752.2168771
http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf


  was:
An implement of Factorization Machines based on Scala and Spark MLlib.
Factorization Machine is a kind of machine learning algorithm for multi-linear 
regression, and is widely used for recommendation.
Factorization Machines works well in recent years' recommendation competitions.

Ref:
http://libfm.org/
http://doi.acm.org/10.1145/2168752.2168771
http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch
 Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implementation of Factorization Machines based on Scala and Spark MLlib.
 FM is a kind of machine learning algorithm for multi-linear regression, and 
 is widely used for recommendation.
 FM works well in recent years' recommendation competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-04-24 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512110#comment-14512110
 ] 

zhengruifeng edited comment on SPARK-7008 at 4/25/15 12:46 AM:
---

The convergence curves of Binary Classification are ploted in attached 
FM_CR.xlsx.
https://issues.apache.org/jira/secure/attachment/12728105/FM_CR.xlsx

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/url_combined.bz2 
is used, and both SGD and LBFGS are tested.

The package is submitted to spark-pacakges.org: 
http://spark-packages.org/package/zhengruifeng/spark-libFM


was (Author: podongfeng):
The convergence curves of Binary Classification are ploted in attached 
FM_CR.xlsx.
https://issues.apache.org/jira/secure/attachment/12728105/FM_CR.xlsx

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/url_combined.bz2 
is used, and both SGD and LBFGS are tested.

 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch
 Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-04-24 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512110#comment-14512110
 ] 

zhengruifeng edited comment on SPARK-7008 at 4/25/15 12:44 AM:
---

The convergence curves of Binary Classification are ploted in attached 
FM_CR.xlsx.
https://issues.apache.org/jira/secure/attachment/12728105/FM_CR.xlsx

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/url_combined.bz2 
is used, and both SGD and LBFGS are tested.


was (Author: podongfeng):
The convergence curves of Binary Classification are ploted in attached 
FM_CR.xlsx.

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/url_combined.bz2 
is used, and both SGD and LBFGS are tested.

 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch
 Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-04-24 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512110#comment-14512110
 ] 

zhengruifeng commented on SPARK-7008:
-

The convergence curves of Binary Classification are ploted in attached 
FM_CR.xlsx.

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/url_combined.bz2 
is used, and both SGD and LBFGS are tested.

 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch
 Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-04-24 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-7008:

Attachment: FM_CR.xlsx

 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch
 Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implement of Factorization Machines based on Scala and Spark MLlib.
 Factorization Machine is a kind of machine learning algorithm for 
 multi-linear regression, and is widely used for recommendation.
 Factorization Machines works well in recent years' recommendation 
 competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-04-27 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513780#comment-14513780
 ] 

zhengruifeng commented on SPARK-7008:
-

AdaGrad works pretty well in practice, but I think there should be another 
issue to add it to MLlib as a new Optimizer for general usage.
And In my humble opinion, it may be better to avoid binding with some specific 
Optimizer for new algorithms.

 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch
 Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implementation of Factorization Machines based on Scala and Spark MLlib.
 FM is a kind of machine learning algorithm for multi-linear regression, and 
 is widely used for recommendation.
 FM works well in recent years' recommendation competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-05-06 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng closed SPARK-7008.
---
Resolution: Fixed

 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.3.2
Reporter: zhengruifeng
  Labels: features, patch
 Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implementation of Factorization Machines based on Scala and Spark MLlib.
 FM is a kind of machine learning algorithm for multi-linear regression, and 
 is widely used for recommendation.
 FM works well in recent years' recommendation competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11585) AssociationRules should generates all association rules with consequents of arbitrary length

2015-11-08 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996174#comment-14996174
 ] 

zhengruifeng commented on SPARK-11585:
--

I have implemented it based on Apriori's Rule-Generation Algorithm:
https://github.com/zhengruifeng/spark-rules

It's compatible with fpm's APIs.

import org.apache.spark.mllib.fpm._
import org.apache.spark.mllib.fpm.FPGrowth

val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat")
val transactions = data.map(s => s.trim.split(' ').map(_.toInt)).persist()


val fpg = new FPGrowth().setMinSupport(0.01)
val model = fpg.run(transactions)

val ar = new 
AprioriRules().setMinConfidence(0.1).setMaxConsequent(1).setNumPartitions(10)
val results = ar.run(model.freqItemsets)



> AssociationRules should generates all association rules with consequents of 
> arbitrary length
> 
>
> Key: SPARK-11585
> URL: https://issues.apache.org/jira/browse/SPARK-11585
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: zhengruifeng
>
> AssociationRules should generates all association rules with consequents of 
> arbitrary length, no just rules which have a single item as the consequent.
> Such as:
> 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117
> 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939
> 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007
> 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418
> 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997
> 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276
> 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11585) AssociationRules should generates all association rules with consequents of arbitrary length

2015-11-09 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996174#comment-14996174
 ] 

zhengruifeng edited comment on SPARK-11585 at 11/9/15 8:11 AM:
---

I have implemented it based on Apriori's Rule-Generation Algorithm:
https://github.com/zhengruifeng/spark-rules

It's compatible with fpm's APIs.

import org.apache.spark.mllib.fpm._

val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat")
val transactions = data.map(s => s.trim.split(' ').map(_.toInt)).persist()


val fpg = new FPGrowth().setMinSupport(0.01)
val model = fpg.run(transactions)

val ar = new 
AprioriRules().setMinConfidence(0.1).setMaxConsequent(15).setNumPartitions(10)
val results = ar.run(model.freqItemsets)


and it output rule-generation infomation like this:
15/11/04 11:28:46 INFO AprioriRules: Candidates for 1-consequent rules : 312917
15/11/04 11:28:58 INFO AprioriRules: Generated 1-consequent rules : 306703
15/11/04 11:29:10 INFO AprioriRules: Candidates for 2-consequent rules : 707747
15/11/04 11:29:35 INFO AprioriRules: Generated 2-consequent rules : 704000
15/11/04 11:29:55 INFO AprioriRules: Candidates for 3-consequent rules : 1020253
15/11/04 11:30:38 INFO AprioriRules: Generated 3-consequent rules : 1014002
15/11/04 11:31:14 INFO AprioriRules: Candidates for 4-consequent rules : 972225
15/11/04 11:32:00 INFO AprioriRules: Generated 4-consequent rules : 956483
15/11/04 11:32:44 INFO AprioriRules: Candidates for 5-consequent rules : 653749
15/11/04 11:33:32 INFO AprioriRules: Generated 5-consequent rules : 626993
15/11/04 11:34:07 INFO AprioriRules: Candidates for 6-consequent rules : 331038
15/11/04 11:34:50 INFO AprioriRules: Generated 6-consequent rules : 314455
15/11/04 11:35:10 INFO AprioriRules: Candidates for 7-consequent rules : 138490
15/11/04 11:35:43 INFO AprioriRules: Generated 7-consequent rules : 136260
15/11/04 11:35:57 INFO AprioriRules: Candidates for 8-consequent rules : 48567
15/11/04 11:36:14 INFO AprioriRules: Generated 8-consequent rules : 47331
15/11/04 11:36:24 INFO AprioriRules: Candidates for 9-consequent rules : 12430
15/11/04 11:36:33 INFO AprioriRules: Generated 9-consequent rules : 11925
15/11/04 11:36:37 INFO AprioriRules: Candidates for 10-consequent rules : 2211
15/11/04 11:36:47 INFO AprioriRules: Generated 10-consequent rules : 2064
15/11/04 11:36:55 INFO AprioriRules: Candidates for 11-consequent rules : 246
15/11/04 11:36:58 INFO AprioriRules: Generated 11-consequent rules : 219
15/11/04 11:37:00 INFO AprioriRules: Candidates for 12-consequent rules : 13
15/11/04 11:37:03 INFO AprioriRules: Generated 12-consequent rules : 11
15/11/04 11:37:03 INFO AprioriRules: Candidates for 13-consequent rules : 0


was (Author: podongfeng):
I have implemented it based on Apriori's Rule-Generation Algorithm:
https://github.com/zhengruifeng/spark-rules

It's compatible with fpm's APIs.

import org.apache.spark.mllib.fpm._
import org.apache.spark.mllib.fpm.FPGrowth

val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat")
val transactions = data.map(s => s.trim.split(' ').map(_.toInt)).persist()


val fpg = new FPGrowth().setMinSupport(0.01)
val model = fpg.run(transactions)

val ar = new 
AprioriRules().setMinConfidence(0.1).setMaxConsequent(1).setNumPartitions(10)
val results = ar.run(model.freqItemsets)



> AssociationRules should generates all association rules with consequents of 
> arbitrary length
> 
>
> Key: SPARK-11585
> URL: https://issues.apache.org/jira/browse/SPARK-11585
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: zhengruifeng
>
> AssociationRules should generates all association rules with consequents of 
> arbitrary length, no just rules which have a single item as the consequent.
> Such as:
> 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117
> 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939
> 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007
> 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418
> 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997
> 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276
> 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11585) AssociationRules should generates all association rules with consequents of arbitrary length

2015-11-08 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-11585:


 Summary: AssociationRules should generates all association rules 
with consequents of arbitrary length
 Key: SPARK-11585
 URL: https://issues.apache.org/jira/browse/SPARK-11585
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: zhengruifeng


AssociationRules should generates all association rules with consequents of 
arbitrary length, no just rules which have a single item as the consequent.

Such as:
39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117
39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939
39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007
819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418
804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997
743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276
39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874
...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11585) AssociationRules should generates all association rules with consequents of arbitrary length

2015-11-09 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-11585:
-
Attachment: rule-generation.pdf

Apriori's Rule Generation Algorithm

> AssociationRules should generates all association rules with consequents of 
> arbitrary length
> 
>
> Key: SPARK-11585
> URL: https://issues.apache.org/jira/browse/SPARK-11585
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: zhengruifeng
> Attachments: rule-generation.pdf
>
>
> AssociationRules should generates all association rules with consequents of 
> arbitrary length, no just rules which have a single item as the consequent.
> Such as:
> 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117
> 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939
> 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007
> 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418
> 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997
> 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276
> 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-07-10 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621830#comment-14621830
 ] 

zhengruifeng commented on SPARK-7008:
-

Yes, LBFGS provide a faster convergence rate.

 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: zhengruifeng
  Labels: features
 Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implementation of Factorization Machines based on Scala and Spark MLlib.
 FM is a kind of machine learning algorithm for multi-linear regression, and 
 is widely used for recommendation.
 FM works well in recent years' recommendation competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15770) 'Experimental' annotation audit

2016-06-04 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-15770:


 Summary: 'Experimental' annotation audit
 Key: SPARK-15770
 URL: https://issues.apache.org/jira/browse/SPARK-15770
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng
Priority: Trivial


1, remove comments {:: Experimental ::} for non-experimental API
2, add comments {:: Experimental ::} for experimental API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15770) 'Experimental' annotation audit

2016-06-04 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15770:
-
Description: 
1, remove comments {{:: Experimental ::}} for non-experimental API
2, add comments {{:: Experimental ::}} for experimental API

  was:
1, remove comments {:: Experimental ::} for non-experimental API
2, add comments {:: Experimental ::} for experimental API


> 'Experimental' annotation audit
> ---
>
> Key: SPARK-15770
> URL: https://issues.apache.org/jira/browse/SPARK-15770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, remove comments {{:: Experimental ::}} for non-experimental API
> 2, add comments {{:: Experimental ::}} for experimental API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15770) annotation audit for Experimental and DeveloperApi

2016-06-04 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15770:
-
Description: 
1, remove comments {{:: Experimental ::}} for non-experimental API
2, add comments {{:: Experimental ::}} for experimental API
3, add comments {{:: Experimental ::}} for experimental API

  was:
1, remove comments {{:: Experimental ::}} for non-experimental API
2, add comments {{:: Experimental ::}} for experimental API


> annotation audit for Experimental and DeveloperApi
> --
>
> Key: SPARK-15770
> URL: https://issues.apache.org/jira/browse/SPARK-15770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, remove comments {{:: Experimental ::}} for non-experimental API
> 2, add comments {{:: Experimental ::}} for experimental API
> 3, add comments {{:: Experimental ::}} for experimental API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15770) annotation audit for Experimental and DeveloperApi

2016-06-04 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15770:
-
Description: 
1, remove comments {{:: Experimental ::}} for non-experimental API
2, add comments {{:: Experimental ::}} for experimental API
3, add comments {{:: DeveloperApi ::}} for developerApi API

  was:
1, remove comments {{:: Experimental ::}} for non-experimental API
2, add comments {{:: Experimental ::}} for experimental API
3, add comments {{:: Experimental ::}} for experimental API


> annotation audit for Experimental and DeveloperApi
> --
>
> Key: SPARK-15770
> URL: https://issues.apache.org/jira/browse/SPARK-15770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, remove comments {{:: Experimental ::}} for non-experimental API
> 2, add comments {{:: Experimental ::}} for experimental API
> 3, add comments {{:: DeveloperApi ::}} for developerApi API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15770) annotation audit for Experimental and DeveloperApi

2016-06-04 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15770:
-
Summary: annotation audit for Experimental and DeveloperApi  (was: 
'Experimental' annotation audit)

> annotation audit for Experimental and DeveloperApi
> --
>
> Key: SPARK-15770
> URL: https://issues.apache.org/jira/browse/SPARK-15770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, remove comments {{:: Experimental ::}} for non-experimental API
> 2, add comments {{:: Experimental ::}} for experimental API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics

2016-06-09 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15322308#comment-15322308
 ] 

zhengruifeng edited comment on SPARK-15823 at 6/9/16 10:20 AM:
---

{{MulticlassMetrics.confusionMatrix}} may need {{@property}} too, but I am not 
sure.
Others seem ok.


was (Author: podongfeng):
{MulticlassMetrics.confusionMatrix} may need {@property} too, but I am not sure.
Others seem ok.

> Add @property for 'accuracy' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics

2016-06-09 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15322309#comment-15322309
 ] 

zhengruifeng commented on SPARK-15823:
--

{MulticlassMetrics.confusionMatrix} may need {@property} too, but I am not sure.
Others seem ok.

> Add @property for 'accuracy' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics

2016-06-09 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15823:
-
Comment: was deleted

(was: {MulticlassMetrics.confusionMatrix} may need {@property} too, but I am 
not sure.
Others seem ok.)

> Add @property for 'accuracy' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics

2016-06-09 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15322308#comment-15322308
 ] 

zhengruifeng commented on SPARK-15823:
--

{MulticlassMetrics.confusionMatrix} may need {@property} too, but I am not sure.
Others seem ok.

> Add @property for 'accuracy' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics

2016-06-09 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15823:
-
Summary: Add @property for 'accuracy' in MulticlassMetrics  (was: Add 
@property for 'property' in MulticlassMetrics)

> Add @property for 'accuracy' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15823) Add @property for 'property' in MulticlassMetrics

2016-06-08 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-15823:


 Summary: Add @property for 'property' in MulticlassMetrics
 Key: SPARK-15823
 URL: https://issues.apache.org/jira/browse/SPARK-15823
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: zhengruifeng
Priority: Minor


'accuracy' should be decorated with `@property` to keep step with other methods 
in `pyspark.MulticlassMetrics`, like `weightedPrecision`, `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15617) Clarify that fMeasure in MulticlassMetrics and MulticlassClassificationEvaluator is "micro" f1_score

2016-05-27 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305089#comment-15305089
 ] 

zhengruifeng commented on SPARK-15617:
--

I can work on this

> Clarify that fMeasure in MulticlassMetrics and 
> MulticlassClassificationEvaluator is "micro" f1_score
> 
>
> Key: SPARK-15617
> URL: https://issues.apache.org/jira/browse/SPARK-15617
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See description in sklearn docs: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html]
> I believe we are calculating the "micro" average for {{val fMeasure: 
> Double}}.  We should clarify this in the docs.
> I'm not sure if "micro" is a common term, so we should check other libraries 
> too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15617) Clarify that fMeasure in MulticlassMetrics and MulticlassClassificationEvaluator is "micro" f1_score

2016-05-27 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305086#comment-15305086
 ] 

zhengruifeng commented on SPARK-15617:
--

Revolutions(http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html#micro)
 also call it `Micro-averaged Metrics` 

> Clarify that fMeasure in MulticlassMetrics and 
> MulticlassClassificationEvaluator is "micro" f1_score
> 
>
> Key: SPARK-15617
> URL: https://issues.apache.org/jira/browse/SPARK-15617
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See description in sklearn docs: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html]
> I believe we are calculating the "micro" average for {{val fMeasure: 
> Double}}.  We should clarify this in the docs.
> I'm not sure if "micro" is a common term, so we should check other libraries 
> too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-05-28 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305239#comment-15305239
 ] 

zhengruifeng commented on SPARK-15581:
--

In regard to gbt, xgboost4j may be involved

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API is to have feature parity 
> with the Scala/Java API. You can find a [complete list here| 
> 

[jira] [Created] (SPARK-15939) Clarify ml.linalg usage

2016-06-14 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-15939:


 Summary: Clarify ml.linalg usage
 Key: SPARK-15939
 URL: https://issues.apache.org/jira/browse/SPARK-15939
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: zhengruifeng
Priority: Trivial


1, update comments in {pyspark.ml} that it use {ml.linalg} not {mllib.linalg}
2, rename {MLlibTestCase} to {MLTestCase} in {ml.tests.py}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15939) Clarify ml.linalg usage

2016-06-14 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15939:
-
Description: 
1, update comments in {{pyspark.ml}} that it use {{ml.linalg}} not 
{{mllib.linalg}}
2, rename {{MLlibTestCase}} to {{MLTestCase}} in {{ml.tests.py}}

  was:
1, update comments in {{pyspark.ml}} that it use {ml.linalg} not {mllib.linalg}
2, rename {MLlibTestCase} to {MLTestCase} in {ml.tests.py}


> Clarify ml.linalg usage
> ---
>
> Key: SPARK-15939
> URL: https://issues.apache.org/jira/browse/SPARK-15939
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, update comments in {{pyspark.ml}} that it use {{ml.linalg}} not 
> {{mllib.linalg}}
> 2, rename {{MLlibTestCase}} to {{MLTestCase}} in {{ml.tests.py}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15939) Clarify ml.linalg usage

2016-06-14 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15939:
-
Description: 
1, update comments in {{pyspark.ml}} that it use {ml.linalg} not {mllib.linalg}
2, rename {MLlibTestCase} to {MLTestCase} in {ml.tests.py}

  was:
1, update comments in {pyspark.ml} that it use {ml.linalg} not {mllib.linalg}
2, rename {MLlibTestCase} to {MLTestCase} in {ml.tests.py}


> Clarify ml.linalg usage
> ---
>
> Key: SPARK-15939
> URL: https://issues.apache.org/jira/browse/SPARK-15939
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, update comments in {{pyspark.ml}} that it use {ml.linalg} not 
> {mllib.linalg}
> 2, rename {MLlibTestCase} to {MLTestCase} in {ml.tests.py}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15650) Add correctness test for MulticlassClassification

2016-05-30 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-15650:


 Summary: Add correctness test for MulticlassClassification
 Key: SPARK-15650
 URL: https://issues.apache.org/jira/browse/SPARK-15650
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng
Priority: Minor


{{BinaryClassificationEvaluatorSuite}} and {{RegressionEvaluatorSuite}} have 
tests for correctness checking, while 
{{MulticlassClassificationEvaluatorSuite}} do not.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15614) ml.feature should support default value of input column

2016-05-30 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306419#comment-15306419
 ] 

zhengruifeng commented on SPARK-15614:
--

Agreed.
What about setting the default value of {{setInputCol}} if the algorithm takes 
features as input?

> ml.feature should support default value of input column
> ---
>
> Key: SPARK-15614
> URL: https://issues.apache.org/jira/browse/SPARK-15614
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {{ml.clasification}} and {{ml.clustering}} use {{"features"}} as default 
> input column. While {{ml.feature}} use {{setInputCol}} method to set input 
> column and don't have default value,  which is somewhat strange.
> It may be nice to support default input column "features" in {{ml.feature}}, 
> and we can make these implements extends {{HasFeaturesCol}} and make existing 
> {{setInputCol}} method just a alias.
> I can work on this if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15650) Add correctness test for MulticlassClassificationEvaluator

2016-05-30 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15650:
-
Summary: Add correctness test for MulticlassClassificationEvaluator  (was: 
Add correctness test for MulticlassClassification)

> Add correctness test for MulticlassClassificationEvaluator
> --
>
> Key: SPARK-15650
> URL: https://issues.apache.org/jira/browse/SPARK-15650
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {{BinaryClassificationEvaluatorSuite}} and {{RegressionEvaluatorSuite}} have 
> tests for correctness checking, while 
> {{MulticlassClassificationEvaluatorSuite}} do not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15291) Remove redundant codes in SVD++

2016-05-27 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng closed SPARK-15291.

Resolution: Won't Fix

> Remove redundant codes in SVD++
> ---
>
> Key: SPARK-15291
> URL: https://issues.apache.org/jira/browse/SPARK-15291
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> val newVertices = g.vertices.mapValues(v => (v._1.toArray, v._2.toArray, 
> v._3, v._4))
> (Graph(newVertices, g.edges), u)
> {code}
> is just the same as 
> {code}
> (g, u)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15607) Remove redundant toArray in ml.linalg

2016-05-27 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng closed SPARK-15607.

Resolution: Won't Fix

> Remove redundant toArray in ml.linalg
> -
>
> Key: SPARK-15607
> URL: https://issues.apache.org/jira/browse/SPARK-15607
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {{sliceInds, sliceVals}} are already of type {{Array}}, so remove {{toArray}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15610) update error message for k in pca

2016-05-27 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15610:
-
Summary: update error message for k in pca  (was: PCA should not support k 
== numFeatures)

> update error message for k in pca
> -
>
> Key: SPARK-15610
> URL: https://issues.apache.org/jira/browse/SPARK-15610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Vector size must be greater than {{k}}, but now it support {{k == 
> vector.size}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15610) PCA should not support k == numFeatures

2016-05-27 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15610:
-
Priority: Minor  (was: Major)

> PCA should not support k == numFeatures
> ---
>
> Key: SPARK-15610
> URL: https://issues.apache.org/jira/browse/SPARK-15610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Vector size must be greater than {{k}}, but now it support {{k == 
> vector.size}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15610) update error message for k in pca

2016-05-27 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15610:
-
Description: error message for {{k}} should match the bound  (was: Vector 
size must be greater than {{k}}, but now it support {{k == vector.size}})

> update error message for k in pca
> -
>
> Key: SPARK-15610
> URL: https://issues.apache.org/jira/browse/SPARK-15610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> error message for {{k}} should match the bound



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15607) Remove redundant toArray in ml.linalg

2016-05-27 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-15607:


 Summary: Remove redundant toArray in ml.linalg
 Key: SPARK-15607
 URL: https://issues.apache.org/jira/browse/SPARK-15607
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng
Priority: Minor


{{sliceInds, sliceVals}} are already of type {{Array}}, so remove {{toArray}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15614) ml.feature should support default value of input column

2016-05-27 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15614:
-
Priority: Minor  (was: Major)

> ml.feature should support default value of input column
> ---
>
> Key: SPARK-15614
> URL: https://issues.apache.org/jira/browse/SPARK-15614
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {{ml.clasification}} and {{ml.clustering}} use {{"features"}} as default 
> input column. While {{ml.feature}} use {{setInputCol}} method to set input 
> column and don't have default value,  which is somewhat strange.
> It may be nice to support default input column "features" in {{ml.feature}}, 
> and we can make these implements extends {{HasFeaturesCol}} and make existing 
> {{setInputCol}} method just a alias.
> I can work on this if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15614) ml.feature should support default value of input column

2016-05-27 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-15614:


 Summary: ml.feature should support default value of input column
 Key: SPARK-15614
 URL: https://issues.apache.org/jira/browse/SPARK-15614
 Project: Spark
  Issue Type: Brainstorming
  Components: ML
Reporter: zhengruifeng


{{ml.clasification}} and {{ml.clustering}} use {{"features"}} as default input 
column. While {{ml.feature}} use {{setInputCol}} method to set input column and 
don't have default value,  which is somewhat strange.
It may be nice to support default input column "features" in {{ml.feature}}, 
and we can make these implements extends {{HasFeaturesCol}} and make existing 
{{setInputCol}} method just a alias.
I can work on this if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15614) ml.feature should support default value of input column

2016-05-27 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303971#comment-15303971
 ] 

zhengruifeng commented on SPARK-15614:
--

[~josephkb] [~mengxr] [~yanboliang] any thoughts?

> ml.feature should support default value of input column
> ---
>
> Key: SPARK-15614
> URL: https://issues.apache.org/jira/browse/SPARK-15614
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {{ml.clasification}} and {{ml.clustering}} use {{"features"}} as default 
> input column. While {{ml.feature}} use {{setInputCol}} method to set input 
> column and don't have default value,  which is somewhat strange.
> It may be nice to support default input column "features" in {{ml.feature}}, 
> and we can make these implements extends {{HasFeaturesCol}} and make existing 
> {{setInputCol}} method just a alias.
> I can work on this if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15614) ml.feature should support default value of input column

2016-05-27 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303971#comment-15303971
 ] 

zhengruifeng edited comment on SPARK-15614 at 5/27/16 11:54 AM:


[~josephkb] [~mengxr] [~yanboliang] [~mlnick] any thoughts?


was (Author: podongfeng):
[~josephkb] [~mengxr] [~yanboliang] any thoughts?

> ml.feature should support default value of input column
> ---
>
> Key: SPARK-15614
> URL: https://issues.apache.org/jira/browse/SPARK-15614
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {{ml.clasification}} and {{ml.clustering}} use {{"features"}} as default 
> input column. While {{ml.feature}} use {{setInputCol}} method to set input 
> column and don't have default value,  which is somewhat strange.
> It may be nice to support default input column "features" in {{ml.feature}}, 
> and we can make these implements extends {{HasFeaturesCol}} and make existing 
> {{setInputCol}} method just a alias.
> I can work on this if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15610) PCA should not support k == numFeatures

2016-05-27 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-15610:


 Summary: PCA should not support k == numFeatures
 Key: SPARK-15610
 URL: https://issues.apache.org/jira/browse/SPARK-15610
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: zhengruifeng


Vector size must be greater than {{k}}, but now it support {{k == vector.size}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15617) Clarify that fMeasure in MulticlassMetrics and MulticlassClassificationEvaluator is "micro" f1_score

2016-06-01 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311752#comment-15311752
 ] 

zhengruifeng commented on SPARK-15617:
--

Agreed.
In {{MulticlassClassificationEvaluator}}, I will remove precision/recall but 
keep f1 (weighted averaged f1-measure, not equal to accury)
For {{MulticlassMetrics}}, I will just update the user guide.
Is this OK?

> Clarify that fMeasure in MulticlassMetrics and 
> MulticlassClassificationEvaluator is "micro" f1_score
> 
>
> Key: SPARK-15617
> URL: https://issues.apache.org/jira/browse/SPARK-15617
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See description in sklearn docs: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html]
> I believe we are calculating the "micro" average for {{val fMeasure: 
> Double}}.  We should clarify this in the docs.
> I'm not sure if "micro" is a common term, so we should check other libraries 
> too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13435) Add Weighted Cohen's kappa to MulticlassMetrics

2016-02-22 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13435:


 Summary: Add Weighted Cohen's kappa to MulticlassMetrics
 Key: SPARK-13435
 URL: https://issues.apache.org/jira/browse/SPARK-13435
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: zhengruifeng


Add the missing Weighted Cohen's kappa to MulticlassMetrics.
Kappa is widely used in Competition and Statistics.
https://en.wikipedia.org/wiki/Cohen's_kappa

Some usage examples:

val metrics = new MulticlassMetrics(predictionAndLabels)

// The default kappa value (Unweighted kappa)
val kappa = metrics.kappa

// Three built-in weighting type ("default":unweighted, "linear":linear 
weighted, "quadratic":quadratic weighted)
val kappa = metrics.kappa("quadratic")

// User-defined weighting matrix
val matrix = Matrices.dense(n, n, values)
val kappa = metrics.kappa(matrix)

// User-defined weighting function
def getWeight(i: Int, j:Int):Double = {
  if (i == j) {
0.0
  } else {
1.0
  }
}
val kappa = metrics.kappa(getWeight) // equals to the unweighted kappa

The calculation correctness was tested on several small data, and compared to 
two python's package:  sklearn and ml_metrics.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13506) Fix the wrong parameter in R code comment in AssociationRulesSuite

2016-02-26 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13506:


 Summary: Fix the wrong parameter in R code comment in 
AssociationRulesSuite 
 Key: SPARK-13506
 URL: https://issues.apache.org/jira/browse/SPARK-13506
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: zhengruifeng
Priority: Trivial


The following R Snippet in AssociationRulesSuite is wrong:

/* Verify results using the `R` code:
   transactions = as(sapply(
 list("r z h k p",
  "z y x w v u t s",
  "s x o n r",
  "x z y m t s q e",
  "z",
  "x z y r q t p"),
 FUN=function(x) strsplit(x," ",fixed=TRUE)),
 "transactions")
   ars = apriori(transactions,
 parameter = list(support = 0.0, confidence = 0.5, 
target="rules", minlen=2))
   arsDF = as(ars, "data.frame")
   arsDF$support = arsDF$support * length(transactions)
   names(arsDF)[names(arsDF) == "support"] = "freq"
   > nrow(arsDF)
   [1] 23
   > sum(arsDF$confidence == 1)
   [1] 23
 */

The real outputs are:
> nrow(arsDF)
[1] 441838
> sum(arsDF$confidence == 1)
[1] 441592

It is found that the parameters in apriori function were wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13538) Add GaussianMixture to ML

2016-02-28 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13538:


 Summary: Add GaussianMixture to ML
 Key: SPARK-13538
 URL: https://issues.apache.org/jira/browse/SPARK-13538
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng
Priority: Minor


Add GaussianMixture and GaussianMixtureModel to ML package




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13550) Add java example for ml.clustering.BisectingKMeans

2016-02-29 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13550:


 Summary: Add java example for ml.clustering.BisectingKMeans
 Key: SPARK-13550
 URL: https://issues.apache.org/jira/browse/SPARK-13550
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng
Priority: Trivial


Add java example for ml.clustering.BisectingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13551) Fix fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample

2016-02-29 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13551:


 Summary: Fix fix wrong comment and remove meanless lines in 
mllib.JavaBisectingKMeansExample
 Key: SPARK-13551
 URL: https://issues.apache.org/jira/browse/SPARK-13551
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: zhengruifeng
Priority: Trivial


this description is wrong:
/**
 * Java example for graph clustering using power iteration clustering (PIC).
 */


this for loop is meanless:
for (Vector center: model.clusterCenters()) {
  System.out.println("");
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13435) Add Weighted Cohen's kappa to MulticlassMetrics

2016-02-22 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157015#comment-15157015
 ] 

zhengruifeng commented on SPARK-13435:
--

I dont think so. 
Recently, many Competitions use quadratic weighted kappa as the evaluation 
metrics.
such as:
https://www.kaggle.com/c/diabetic-retinopathy-detection/details/evaluation
https://www.kaggle.com/c/prudential-life-insurance-assessment/details/evaluation
...

The unweighted kappa is very easy to compute, especially for binary 
classification. But the weighted one is not so obvious, and cause many 
confusion. You can find in kaggle's forum that many people are confused with it.

> Add Weighted Cohen's kappa to MulticlassMetrics
> ---
>
> Key: SPARK-13435
> URL: https://issues.apache.org/jira/browse/SPARK-13435
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> Add the missing Weighted Cohen's kappa to MulticlassMetrics.
> Kappa is widely used in Competition and Statistics.
> https://en.wikipedia.org/wiki/Cohen's_kappa
> Some usage examples:
> val metrics = new MulticlassMetrics(predictionAndLabels)
> // The default kappa value (Unweighted kappa)
> val kappa = metrics.kappa
> // Three built-in weighting type ("default":unweighted, "linear":linear 
> weighted, "quadratic":quadratic weighted)
> val kappa = metrics.kappa("quadratic")
> // User-defined weighting matrix
> val matrix = Matrices.dense(n, n, values)
> val kappa = metrics.kappa(matrix)
> // User-defined weighting function
> def getWeight(i: Int, j:Int):Double = {
>   if (i == j) {
> 0.0
>   } else {
> 1.0
>   }
> }
> val kappa = metrics.kappa(getWeight) // equals to the unweighted kappa
> The calculation correctness was tested on several small data, and compared to 
> two python's package:  sklearn and ml_metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13385) Enable AssociationRules to generate consequents with user-defined lengths

2016-02-18 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13385:


 Summary: Enable AssociationRules to generate consequents with 
user-defined lengths
 Key: SPARK-13385
 URL: https://issues.apache.org/jira/browse/SPARK-13385
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.6.0
Reporter: zhengruifeng


AssociationRules should generates all association rules with user-defined 
iterations, no just rules which have a single item as the consequent.

Such as:
39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117
39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939
39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007
819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418
804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997
743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276
39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874
...


I have implemented it based on Apriori's Rule-Generation Algorithm:
https://github.com/zhengruifeng/spark-rules

It's compatible with fpm's APIs.

import org.apache.spark.mllib.fpm._

val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat")
val transactions = data.map(s => s.trim.split(' ')).persist()

val fpg = new FPGrowth().setMinSupport(0.01)
val model = fpg.run(transactions)

val ar = new AprioriRules().setMinConfidence(0.1).setMaxConsequent(15)
val results = ar.run(model.freqItemsets)

and it output rule-generation infomation like this:
15/11/04 11:28:46 INFO AprioriRules: Candidates for 1-consequent rules : 312917
15/11/04 11:28:58 INFO AprioriRules: Generated 1-consequent rules : 306703
15/11/04 11:29:10 INFO AprioriRules: Candidates for 2-consequent rules : 707747
15/11/04 11:29:35 INFO AprioriRules: Generated 2-consequent rules : 704000
15/11/04 11:29:55 INFO AprioriRules: Candidates for 3-consequent rules : 1020253
15/11/04 11:30:38 INFO AprioriRules: Generated 3-consequent rules : 1014002
15/11/04 11:31:14 INFO AprioriRules: Candidates for 4-consequent rules : 972225
15/11/04 11:32:00 INFO AprioriRules: Generated 4-consequent rules : 956483
15/11/04 11:32:44 INFO AprioriRules: Candidates for 5-consequent rules : 653749
15/11/04 11:33:32 INFO AprioriRules: Generated 5-consequent rules : 626993
15/11/04 11:34:07 INFO AprioriRules: Candidates for 6-consequent rules : 331038
15/11/04 11:34:50 INFO AprioriRules: Generated 6-consequent rules : 314455
15/11/04 11:35:10 INFO AprioriRules: Candidates for 7-consequent rules : 138490
15/11/04 11:35:43 INFO AprioriRules: Generated 7-consequent rules : 136260
15/11/04 11:35:57 INFO AprioriRules: Candidates for 8-consequent rules : 48567
15/11/04 11:36:14 INFO AprioriRules: Generated 8-consequent rules : 47331
15/11/04 11:36:24 INFO AprioriRules: Candidates for 9-consequent rules : 12430
15/11/04 11:36:33 INFO AprioriRules: Generated 9-consequent rules : 11925
15/11/04 11:36:37 INFO AprioriRules: Candidates for 10-consequent rules : 2211
15/11/04 11:36:47 INFO AprioriRules: Generated 10-consequent rules : 2064
15/11/04 11:36:55 INFO AprioriRules: Candidates for 11-consequent rules : 246
15/11/04 11:36:58 INFO AprioriRules: Generated 11-consequent rules : 219
15/11/04 11:37:00 INFO AprioriRules: Candidates for 12-consequent rules : 13
15/11/04 11:37:03 INFO AprioriRules: Generated 12-consequent rules : 11
15/11/04 11:37:03 INFO AprioriRules: Candidates for 13-consequent rules : 0




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13385) Enable AssociationRules to generate consequents with user-defined lengths

2016-02-18 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13385:
-
Attachment: rule-generation.pdf

rule-generation algorithm

> Enable AssociationRules to generate consequents with user-defined lengths
> -
>
> Key: SPARK-13385
> URL: https://issues.apache.org/jira/browse/SPARK-13385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: zhengruifeng
> Attachments: rule-generation.pdf
>
>
> AssociationRules should generates all association rules with user-defined 
> iterations, no just rules which have a single item as the consequent.
> Such as:
> 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117
> 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939
> 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007
> 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418
> 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997
> 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276
> 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874
> ...
> I have implemented it based on Apriori's Rule-Generation Algorithm:
> https://github.com/zhengruifeng/spark-rules
> It's compatible with fpm's APIs.
> import org.apache.spark.mllib.fpm._
> val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat")
> val transactions = data.map(s => s.trim.split(' ')).persist()
> val fpg = new FPGrowth().setMinSupport(0.01)
> val model = fpg.run(transactions)
> val ar = new AprioriRules().setMinConfidence(0.1).setMaxConsequent(15)
> val results = ar.run(model.freqItemsets)
> and it output rule-generation infomation like this:
> 15/11/04 11:28:46 INFO AprioriRules: Candidates for 1-consequent rules : 
> 312917
> 15/11/04 11:28:58 INFO AprioriRules: Generated 1-consequent rules : 306703
> 15/11/04 11:29:10 INFO AprioriRules: Candidates for 2-consequent rules : 
> 707747
> 15/11/04 11:29:35 INFO AprioriRules: Generated 2-consequent rules : 704000
> 15/11/04 11:29:55 INFO AprioriRules: Candidates for 3-consequent rules : 
> 1020253
> 15/11/04 11:30:38 INFO AprioriRules: Generated 3-consequent rules : 1014002
> 15/11/04 11:31:14 INFO AprioriRules: Candidates for 4-consequent rules : 
> 972225
> 15/11/04 11:32:00 INFO AprioriRules: Generated 4-consequent rules : 956483
> 15/11/04 11:32:44 INFO AprioriRules: Candidates for 5-consequent rules : 
> 653749
> 15/11/04 11:33:32 INFO AprioriRules: Generated 5-consequent rules : 626993
> 15/11/04 11:34:07 INFO AprioriRules: Candidates for 6-consequent rules : 
> 331038
> 15/11/04 11:34:50 INFO AprioriRules: Generated 6-consequent rules : 314455
> 15/11/04 11:35:10 INFO AprioriRules: Candidates for 7-consequent rules : 
> 138490
> 15/11/04 11:35:43 INFO AprioriRules: Generated 7-consequent rules : 136260
> 15/11/04 11:35:57 INFO AprioriRules: Candidates for 8-consequent rules : 48567
> 15/11/04 11:36:14 INFO AprioriRules: Generated 8-consequent rules : 47331
> 15/11/04 11:36:24 INFO AprioriRules: Candidates for 9-consequent rules : 12430
> 15/11/04 11:36:33 INFO AprioriRules: Generated 9-consequent rules : 11925
> 15/11/04 11:36:37 INFO AprioriRules: Candidates for 10-consequent rules : 2211
> 15/11/04 11:36:47 INFO AprioriRules: Generated 10-consequent rules : 2064
> 15/11/04 11:36:55 INFO AprioriRules: Candidates for 11-consequent rules : 246
> 15/11/04 11:36:58 INFO AprioriRules: Generated 11-consequent rules : 219
> 15/11/04 11:37:00 INFO AprioriRules: Candidates for 12-consequent rules : 13
> 15/11/04 11:37:03 INFO AprioriRules: Generated 12-consequent rules : 11
> 15/11/04 11:37:03 INFO AprioriRules: Candidates for 13-consequent rules : 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13386) ConnectedComponents should support maxIteration option

2016-02-19 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13386:


 Summary: ConnectedComponents should support maxIteration option
 Key: SPARK-13386
 URL: https://issues.apache.org/jira/browse/SPARK-13386
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: zhengruifeng


Runing ConnectedComponents is time-consuming on big and complex graph.
I use it on a graph with 1.7B vertices and 11B edges, and the exact result is 
not a must. So I think user can directly control the maxIteration of this 
algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13416) Add positive check for option 'numIter' in StronglyConnectedComponents

2016-02-20 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13416:


 Summary: Add positive check for option 'numIter' in 
StronglyConnectedComponents 
 Key: SPARK-13416
 URL: https://issues.apache.org/jira/browse/SPARK-13416
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: zhengruifeng
Priority: Minor


The output of StronglyConnectedComponents with numIter no greater than 1 may 
make no sense. So I just add require check in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13814) Delete unnecessary imports in python examples files

2016-03-10 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13814:


 Summary: Delete unnecessary imports in python examples files
 Key: SPARK-13814
 URL: https://issues.apache.org/jira/browse/SPARK-13814
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: zhengruifeng
Priority: Trivial


Delete unnecessary imports in python examples files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13816) Add parameter checks for algorithms in Graphx

2016-03-11 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13816:


 Summary: Add parameter checks for algorithms in Graphx 
 Key: SPARK-13816
 URL: https://issues.apache.org/jira/browse/SPARK-13816
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: zhengruifeng
Priority: Trivial


Add parameter checks in Graphx-Algorithms:

maxIterations in Pregel 
maxSteps in LabelPropagation
numIter,resetProb,tol in PageRank
maxIters,maxVal,minVal in SVDPlusPlus





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14005) Make RDD more compatible with Scala's collection

2016-03-19 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202658#comment-15202658
 ] 

zhengruifeng commented on SPARK-14005:
--

I think easiness to implement should not be the reason to ignore convenience.

> Make RDD more compatible with Scala's collection 
> -
>
> Key: SPARK-14005
> URL: https://issues.apache.org/jira/browse/SPARK-14005
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Reporter: zhengruifeng
>Priority: Trivial
>
> How about implementing some more methods for RDD to make it more compatible 
> with Scala's collection?
> Such as:
> nonEmpty, slice, takeRight, contains, last, reverse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13970) Add Non-Negative Matrix Factorization to MLlib

2016-03-19 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13970:


 Summary: Add Non-Negative Matrix Factorization to MLlib
 Key: SPARK-13970
 URL: https://issues.apache.org/jira/browse/SPARK-13970
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: zhengruifeng
Priority: Minor


NMF is to find two non-negative matrices (W, H) whose product W * H.T 
approximates the non-negative matrix X. This factorization can be used for 
example for dimensionality reduction, source separation or topic extraction.

NMF was implemented in several packages:
Scikit-Learn 
(http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF)
R-NMF (https://cran.r-project.org/web/packages/NMF/index.html)
LibNMF (http://www.univie.ac.at/rlcta/software/)

I have implemented in MLlib according to the following papers:
Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis 
on MapReduce (http://research.microsoft.com/pubs/119077/DNMF.pdf)
Algorithms for Non-negative Matrix Factorization 
(http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf)

It can be used like this:

val m = 4
val n = 3
val data = Seq(
(0L, Vectors.dense(0.0, 1.0, 2.0)),
(1L, Vectors.dense(3.0, 4.0, 5.0)),
(3L, Vectors.dense(9.0, 0.0, 1.0))
  ).map(x => IndexedRow(x._1, x._2))

val A = new IndexedRowMatrix(indexedRows).toCoordinateMatrix()
val k = 2

// run the nmf algo
val r = NMF.solve(A, k, 10)

val rW = r.W.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
>>> org.apache.spark.mllib.linalg.DenseMatrix =
1.1349295096806706  1.4423101890626953E-5
3.453054133110303   0.46312492493865615
0.0 0.0
0.3133764134585149  2.70684017255672

val rH = r.H.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
>>> org.apache.spark.mllib.linalg.DenseMatrix =
0.4184163313845057  3.2719352525149286
1.121880126136450.002939823716977737
1.456499371939653   0.18992996116069297


val R = rW.multiply(rH.transpose)
>>> org.apache.spark.mllib.linalg.DenseMatrix =
0.4749202332761286  1.2732549038779071.6530268574248572
2.9601290106732367  3.8752743120480346   5.117332475154927
0.0 0.0  0.0
8.987727592773672   0.35952840319637736  0.9705425982249293

val AD = A.toBlockMatrix().toLocalMatrix()
>>> org.apache.spark.mllib.linalg.Matrix =
0.0  1.0  2.0
3.0  4.0  5.0
0.0  0.0  0.0
9.0  0.0  1.0

var loss = 0.0
for(i <- 0 until AD.numRows; j <- 0 until AD.numCols) {
   val diff = AD(i, j) - R(i, j)
   loss += diff * diff
}
loss
>>> Double = 0.5817999580912183





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?

2016-03-19 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14022:


 Summary: What about adding RandomProjection to ML/MLLIB as a new 
dimensionality reduction algorithm?
 Key: SPARK-14022
 URL: https://issues.apache.org/jira/browse/SPARK-14022
 Project: Spark
  Issue Type: Question
Reporter: zhengruifeng
Priority: Minor


What about adding RandomProjection to ML/MLLIB as a new dimensionality 
reduction algorithm?
RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces the 
dimensionality by projecting the original input space on a randomly generated 
matrix. 
It is fully scalable, and runs fast (maybe fastest).
It was implemented in sklearn 
(http://scikit-learn.org/stable/modules/random_projection.html)
I am be willing to do this, if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13712) Add OneVsOne to ML

2016-03-14 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13712:
-
Comment: was deleted

(was: OK, I have closed the PR.
I had also planned to implement ECC after this PR.
In general, OneVsOne is slowest among the three methods, but it generate the 
highest accuracy. ECC is the fastest one (about log(num_class) submodels) with 
lowest accuracy. OneVsRest is in middle of them, both speed and accuracy.
In most case, num_class is a small number, and so OneVsOne is useful.
Suppose there are 3 classes, OneVsOne is even faster than OneVsRest. So I think 
it may be a useful choice for user.)

> Add OneVsOne to ML
> --
>
> Key: SPARK-13712
> URL: https://issues.apache.org/jira/browse/SPARK-13712
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Another Meta method for multi-class classification.
> Most classification algorithms were designed for balanced data.
> The OneVsRest method will generate K models on imbalanced data.
> The OneVsOne will train K*(K-1)/2 models on balanced data.
> OneVsOne is less sensitive to the problems of imbalanced datasets, and can 
> usually result in higher precision.
> But it is much more computationally expensive, although each model are 
> trained on a much smaller dataset. (2/K of total)
> The OneVsOne is implemented in the way OneVsRest did:
> val classifier = new LogisticRegression()
> val ovo = new OneVsOne()
> ovo.setClassifier(classifier)
> val ovoModel = ovo.fit(data)
> val predictions = ovoModel.transform(data)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13712) Add OneVsOne to ML

2016-03-14 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194554#comment-15194554
 ] 

zhengruifeng commented on SPARK-13712:
--

OK, I have closed the PR.
I had also planned to implement ECC after this PR.
In general, OneVsOne is slowest among the three methods, but it generate the 
highest accuracy. ECC is the fastest one (about log(num_class) submodels) with 
lowest accuracy. OneVsRest is in middle of them, both speed and accuracy.
In most case, num_class is a small number, and so OneVsOne is useful.
Suppose there are 3 classes, OneVsOne is even faster than OneVsRest. So I think 
it may be a useful choice for user.

> Add OneVsOne to ML
> --
>
> Key: SPARK-13712
> URL: https://issues.apache.org/jira/browse/SPARK-13712
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Another Meta method for multi-class classification.
> Most classification algorithms were designed for balanced data.
> The OneVsRest method will generate K models on imbalanced data.
> The OneVsOne will train K*(K-1)/2 models on balanced data.
> OneVsOne is less sensitive to the problems of imbalanced datasets, and can 
> usually result in higher precision.
> But it is much more computationally expensive, although each model are 
> trained on a much smaller dataset. (2/K of total)
> The OneVsOne is implemented in the way OneVsRest did:
> val classifier = new LogisticRegression()
> val ovo = new OneVsOne()
> ovo.setClassifier(classifier)
> val ovoModel = ovo.fit(data)
> val predictions = ovoModel.transform(data)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13712) Add OneVsOne to ML

2016-03-14 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194555#comment-15194555
 ] 

zhengruifeng commented on SPARK-13712:
--

OK, I have closed the PR.
I had also planned to implement ECC after this PR.
In general, OneVsOne is slowest among the three methods, but it generate the 
highest accuracy. ECC is the fastest one (about log(num_class) submodels) with 
lowest accuracy. OneVsRest is in middle of them, both speed and accuracy.
In most case, num_class is a small number, and so OneVsOne is useful.
Suppose there are 3 classes, OneVsOne is even faster than OneVsRest. So I think 
it may be a useful choice for user.

> Add OneVsOne to ML
> --
>
> Key: SPARK-13712
> URL: https://issues.apache.org/jira/browse/SPARK-13712
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Another Meta method for multi-class classification.
> Most classification algorithms were designed for balanced data.
> The OneVsRest method will generate K models on imbalanced data.
> The OneVsOne will train K*(K-1)/2 models on balanced data.
> OneVsOne is less sensitive to the problems of imbalanced datasets, and can 
> usually result in higher precision.
> But it is much more computationally expensive, although each model are 
> trained on a much smaller dataset. (2/K of total)
> The OneVsOne is implemented in the way OneVsRest did:
> val classifier = new LogisticRegression()
> val ovo = new OneVsOne()
> ovo.setClassifier(classifier)
> val ovoModel = ovo.fit(data)
> val predictions = ovoModel.transform(data)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14516) Clustering evaluator

2016-04-13 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239140#comment-15239140
 ] 

zhengruifeng commented on SPARK-14516:
--

ok, I will work on clarify this API.

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14516) Clustering evaluator

2016-04-13 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239140#comment-15239140
 ] 

zhengruifeng edited comment on SPARK-14516 at 4/13/16 12:22 PM:


ok, I will clarify this API.


was (Author: podongfeng):
ok, I will work on clarify this API.

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14510) Add args-checking for LDA and StreamingKMeans

2016-04-09 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14510:


 Summary: Add args-checking for LDA and StreamingKMeans
 Key: SPARK-14510
 URL: https://issues.apache.org/jira/browse/SPARK-14510
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: zhengruifeng


Add args-checking for LDA and StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14509) Add python CountVectorizerExample

2016-04-09 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14509:


 Summary: Add python CountVectorizerExample
 Key: SPARK-14509
 URL: https://issues.apache.org/jira/browse/SPARK-14509
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: zhengruifeng
Priority: Minor


Add the missing python example for CountVectorizer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?

2016-04-10 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14022:
-
Issue Type: Brainstorming  (was: Question)

> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> ---
>
> Key: SPARK-14022
> URL: https://issues.apache.org/jira/browse/SPARK-14022
> Project: Spark
>  Issue Type: Brainstorming
>Reporter: zhengruifeng
>Priority: Minor
>
> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces 
> the dimensionality by projecting the original input space on a randomly 
> generated matrix. 
> It is fully scalable, and runs fast (maybe fastest).
> It was implemented in sklearn 
> (http://scikit-learn.org/stable/modules/random_projection.html)
> I am be willing to do this, if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?

2016-04-10 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reopened SPARK-14022:
--

There may need some discuss on whether to add RandomProjection or Not. 

> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> ---
>
> Key: SPARK-14022
> URL: https://issues.apache.org/jira/browse/SPARK-14022
> Project: Spark
>  Issue Type: Brainstorming
>Reporter: zhengruifeng
>Priority: Minor
>
> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces 
> the dimensionality by projecting the original input space on a randomly 
> generated matrix. 
> It is fully scalable, and runs fast (maybe fastest).
> It was implemented in sklearn 
> (http://scikit-learn.org/stable/modules/random_projection.html)
> I am be willing to do this, if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14514) Add python example for VectorSlicer

2016-04-09 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14514:


 Summary: Add python example for VectorSlicer
 Key: SPARK-14514
 URL: https://issues.apache.org/jira/browse/SPARK-14514
 Project: Spark
  Issue Type: Improvement
Reporter: zhengruifeng


Add the missing python example for VectorSlicer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14515) Add python example for ChiSqSelector

2016-04-09 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14515:


 Summary: Add python example for ChiSqSelector
 Key: SPARK-14515
 URL: https://issues.apache.org/jira/browse/SPARK-14515
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: zhengruifeng


Add the missing python example for ChiSqSelector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14514) Add python example for VectorSlicer

2016-04-09 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14514:
-
Component/s: Documentation

> Add python example for VectorSlicer
> ---
>
> Key: SPARK-14514
> URL: https://issues.apache.org/jira/browse/SPARK-14514
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: zhengruifeng
>
> Add the missing python example for VectorSlicer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?

2016-04-10 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233937#comment-15233937
 ] 

zhengruifeng commented on SPARK-14022:
--

Ok, I change the Type from Question to Brainstroming.
I reopen this JIRA because I think it maybe nice to add RandomProjection 
algorithm.

> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> ---
>
> Key: SPARK-14022
> URL: https://issues.apache.org/jira/browse/SPARK-14022
> Project: Spark
>  Issue Type: Brainstorming
>Reporter: zhengruifeng
>Priority: Minor
>
> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces 
> the dimensionality by projecting the original input space on a randomly 
> generated matrix. 
> It is fully scalable, and runs fast (maybe fastest).
> It was implemented in sklearn 
> (http://scikit-learn.org/stable/modules/random_projection.html)
> I am be willing to do this, if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13385) Enable AssociationRules to generate consequents with user-defined lengths

2016-04-09 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13385:
-
Priority: Major  (was: Minor)

> Enable AssociationRules to generate consequents with user-defined lengths
> -
>
> Key: SPARK-13385
> URL: https://issues.apache.org/jira/browse/SPARK-13385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
> Attachments: rule-generation.pdf
>
>
> AssociationRules should generates all association rules with user-defined 
> iterations, no just rules which have a single item as the consequent.
> Such as:
> 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117
> 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939
> 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007
> 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418
> 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997
> 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276
> 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874
> ...
> I have implemented it based on Apriori's Rule-Generation Algorithm:
> https://github.com/zhengruifeng/spark-rules
> It's compatible with fpm's APIs.
> import org.apache.spark.mllib.fpm._
> val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat")
> val transactions = data.map(s => s.trim.split(' ')).persist()
> val fpg = new FPGrowth().setMinSupport(0.01)
> val model = fpg.run(transactions)
> val ar = new AprioriRules().setMinConfidence(0.1).setMaxConsequent(15)
> val results = ar.run(model.freqItemsets)
> and it output rule-generation infomation like this:
> 15/11/04 11:28:46 INFO AprioriRules: Candidates for 1-consequent rules : 
> 312917
> 15/11/04 11:28:58 INFO AprioriRules: Generated 1-consequent rules : 306703
> 15/11/04 11:29:10 INFO AprioriRules: Candidates for 2-consequent rules : 
> 707747
> 15/11/04 11:29:35 INFO AprioriRules: Generated 2-consequent rules : 704000
> 15/11/04 11:29:55 INFO AprioriRules: Candidates for 3-consequent rules : 
> 1020253
> 15/11/04 11:30:38 INFO AprioriRules: Generated 3-consequent rules : 1014002
> 15/11/04 11:31:14 INFO AprioriRules: Candidates for 4-consequent rules : 
> 972225
> 15/11/04 11:32:00 INFO AprioriRules: Generated 4-consequent rules : 956483
> 15/11/04 11:32:44 INFO AprioriRules: Candidates for 5-consequent rules : 
> 653749
> 15/11/04 11:33:32 INFO AprioriRules: Generated 5-consequent rules : 626993
> 15/11/04 11:34:07 INFO AprioriRules: Candidates for 6-consequent rules : 
> 331038
> 15/11/04 11:34:50 INFO AprioriRules: Generated 6-consequent rules : 314455
> 15/11/04 11:35:10 INFO AprioriRules: Candidates for 7-consequent rules : 
> 138490
> 15/11/04 11:35:43 INFO AprioriRules: Generated 7-consequent rules : 136260
> 15/11/04 11:35:57 INFO AprioriRules: Candidates for 8-consequent rules : 48567
> 15/11/04 11:36:14 INFO AprioriRules: Generated 8-consequent rules : 47331
> 15/11/04 11:36:24 INFO AprioriRules: Candidates for 9-consequent rules : 12430
> 15/11/04 11:36:33 INFO AprioriRules: Generated 9-consequent rules : 11925
> 15/11/04 11:36:37 INFO AprioriRules: Candidates for 10-consequent rules : 2211
> 15/11/04 11:36:47 INFO AprioriRules: Generated 10-consequent rules : 2064
> 15/11/04 11:36:55 INFO AprioriRules: Candidates for 11-consequent rules : 246
> 15/11/04 11:36:58 INFO AprioriRules: Generated 11-consequent rules : 219
> 15/11/04 11:37:00 INFO AprioriRules: Candidates for 12-consequent rules : 13
> 15/11/04 11:37:03 INFO AprioriRules: Generated 12-consequent rules : 11
> 15/11/04 11:37:03 INFO AprioriRules: Candidates for 13-consequent rules : 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14512) Add python example for QuantileDiscretizer

2016-04-09 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14512:


 Summary: Add python example for QuantileDiscretizer
 Key: SPARK-14512
 URL: https://issues.apache.org/jira/browse/SPARK-14512
 Project: Spark
  Issue Type: Improvement
Reporter: zhengruifeng


Add the missing python example for QuantileDiscretizer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14516) What about adding general clustering metrics?

2016-04-10 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14516:


 Summary: What about adding general clustering metrics?
 Key: SPARK-14516
 URL: https://issues.apache.org/jira/browse/SPARK-14516
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: zhengruifeng


ML/MLLIB dont have any general purposed clustering metrics with a ground truth.
In 
[Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
 there are several kinds of metrics for this.
It may be meaningful to add some clustering metrics into ML/MLLIB.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?

2016-04-10 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233941#comment-15233941
 ] 

zhengruifeng commented on SPARK-14022:
--

cc [~yanboliang] [~mengxr] [~josephkb]

> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> ---
>
> Key: SPARK-14022
> URL: https://issues.apache.org/jira/browse/SPARK-14022
> Project: Spark
>  Issue Type: Brainstorming
>Reporter: zhengruifeng
>Priority: Minor
>
> What about adding RandomProjection to ML/MLLIB as a new dimensionality 
> reduction algorithm?
> RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces 
> the dimensionality by projecting the original input space on a randomly 
> generated matrix. 
> It is fully scalable, and runs fast (maybe fastest).
> It was implemented in sklearn 
> (http://scikit-learn.org/stable/modules/random_projection.html)
> I am be willing to do this, if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14516) What about adding general clustering metrics?

2016-04-10 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233946#comment-15233946
 ] 

zhengruifeng commented on SPARK-14516:
--

cc [~mengxr] [~josephkb] [~yanboliang]

> What about adding general clustering metrics?
> -
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML, MLlib
>Reporter: zhengruifeng
>
> ML/MLLIB dont have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into ML/MLLIB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14027) Add parameter check to GradientDescent

2016-03-20 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14027:


 Summary: Add parameter check to GradientDescent
 Key: SPARK-14027
 URL: https://issues.apache.org/jira/browse/SPARK-14027
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: zhengruifeng
Priority: Minor


The following code should throw some exception, not just run successfully and 
return a model:

val data = MLUtils.loadLibSVMFile(sc, "/tmp/sample_libsvm_data.txt")
val model = LogisticRegressionWithSGD.train(data, -2, -0.01, 0.5)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14030) Add parameter check to LBFGS

2016-03-20 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14030:


 Summary: Add parameter check to LBFGS
 Key: SPARK-14030
 URL: https://issues.apache.org/jira/browse/SPARK-14030
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: zhengruifeng
Priority: Trivial


Add the missing parameter verification in LBFGS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14005) Make RDD more compatible with Scala's collection

2016-03-19 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203056#comment-15203056
 ] 

zhengruifeng commented on SPARK-14005:
--

ok, plz close this jira.

> Make RDD more compatible with Scala's collection 
> -
>
> Key: SPARK-14005
> URL: https://issues.apache.org/jira/browse/SPARK-14005
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Reporter: zhengruifeng
>Priority: Trivial
>
> How about implementing some more methods for RDD to make it more compatible 
> with Scala's collection?
> Such as:
> nonEmpty, slice, takeRight, contains, last, reverse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2016-03-25 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212776#comment-15212776
 ] 

zhengruifeng commented on SPARK-14174:
--

There is another sklean example for MiniBatch KMeans:

http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2016-03-25 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14174:


 Summary: Accelerate KMeans via Mini-Batch EM
 Key: SPARK-14174
 URL: https://issues.apache.org/jira/browse/SPARK-14174
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: zhengruifeng
Priority: Minor


The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
mini-batches to reduce the computation time, while still attempting to optimise 
the same objective function. Mini-batches are subsets of the input data, 
randomly sampled in each training iteration. These mini-batches drastically 
reduce the amount of computation required to converge to a local solution. In 
contrast to other algorithms that reduce the convergence time of k-means, 
mini-batch k-means produces results that are generally only slightly worse than 
the standard algorithm.

I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
significant.
The MiniBatch KMeans is named XMeans in following lines.

val path = "/tmp/mnist8m.scale"
val data = MLUtils.loadLibSVMFile(sc, path)
val vecs = data.map(_.features).persist()

val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", seed=123l)
km.computeCost(vecs)
res0: Double = 3.317029898599564E8

val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
xm.computeCost(vecs)
res1: Double = 3.3169865959604424E8

val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
xm2.computeCost(vecs)
res2: Double = 3.317195831216454E8

The above three training all reached the max number of iterations 10.
We can see that the WSSSEs are almost the same. While their speed perfermence 
have significant difference:
KMeans2876sec
MiniBatch KMeans (fraction=0.1) 263sec
MiniBatch KMeans (fraction=0.01)   90sec

With appropriate fraction, the bigger the dataset is, the higher speedup is.

The data used above have 8,100,000 samples, 784 features. It can be downloaded 
here 
(https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14005) Make RDD more compatible with Scala's collection

2016-03-19 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14005:


 Summary: Make RDD more compatible with Scala's collection 
 Key: SPARK-14005
 URL: https://issues.apache.org/jira/browse/SPARK-14005
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Reporter: zhengruifeng
Priority: Trivial


How about implementing some more methods for RDD to make it more compatible 
with Scala's collection?
Such as:
nonEmpty, slice, takeRight, contains, last, reverse




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13677) Support Tree-Based Feature Transformation for mllib

2016-03-04 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13677:


 Summary: Support Tree-Based Feature Transformation for mllib
 Key: SPARK-13677
 URL: https://issues.apache.org/jira/browse/SPARK-13677
 Project: Spark
  Issue Type: New Feature
Reporter: zhengruifeng
Priority: Minor


It would be nice to be able to use RF and GBT for feature transformation:
First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
implemented in two famous library:
sklearn 
(http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
xgboost 
(https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)

I have implement it in mllib:

val features : RDD[Vector] = ...
val model1 : RandomForestModel = ...
val transformed1 : RDD[Vector] = model1.leaf(features)

val model2 : GradientBoostedTreesModel = ...
val transformed2 : RDD[Vector] = model2.leaf(features)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB

2016-03-04 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13672:


 Summary: Add python examples of BisectingKMeans in ML and MLLIB
 Key: SPARK-13672
 URL: https://issues.apache.org/jira/browse/SPARK-13672
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: zhengruifeng
Priority: Trivial


add the missing python examples of BisectingKMeans for ml and mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13714) Another ConnectedComponents based on Max-Degree Propagation

2016-03-07 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13714:


 Summary: Another ConnectedComponents based on Max-Degree 
Propagation
 Key: SPARK-13714
 URL: https://issues.apache.org/jira/browse/SPARK-13714
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: zhengruifeng
Priority: Minor


Current ConnectedComponents algorithm was based on Min-VertexId Propagation, 
which is sensitive to the place of Min-VertexId.
While this implementation is based on Max-Degree Propagation.
First, the degree graph is computed. And in the pregel progress, the vertex 
with the max degree in a CC is the start point of propagation.
This new method has advantages over the old one:
1, The convergence is only determined by the structs of CC, and is robust to 
the place of vertex with Min-ID.
2, For spherical CCs in which there may be a concept like 'center', it can 
accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), 
the old CC need 4 supersteps, while the new one only need 2 supersteps.
3, If we limit the number of iteration, the new method tend to generate more 
acceptable results.
4, The output for each CC is the vertex with max degree in it, which may be 
more meaningful. And because the vertex-ID is nominal in most cases, the vertex 
with min-ID in a CC is somewhat meanless.

But there are still two disadvantages:
1,The message boy grows, from (VID) to (VID, Degree). that is (Long) -> (Long, 
Int)
2,For graph with simple CCs, it may be slower than old one. Because it need a 
extra degree computation.

The api is the same like ConnectedComponents:

val graph = ...
val cc = graph.ConnectedComponentsWithDegree(100)
or
val cc = ConnectedComponentsWithDegree.run(graph, 100)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13714) Another ConnectedComponents based on Max-Degree Propagation

2016-03-07 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13714:
-
Description: 
Current ConnectedComponents algorithm was based on Min-VertexId Propagation, 
which is sensitive to the place of Min-VertexId.
While this implementation is based on Max-Degree Propagation.
First, the degree graph is computed. And in the pregel progress, the vertex 
with the max degree in a CC is the start point of propagation.
This new method has advantages over the old one:
1, The convergence is only determined by the structs of CC, and is robust to 
the place of vertex with Min-ID.
2, For spherical CCs in which there may be a concept like 'center', it can 
accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), 
the old CC need 4 supersteps, while the new one only need 2 supersteps.
3, If we limit the number of iteration, the new method tend to generate more 
acceptable results.
4, The output for each CC is the vertex with max degree in it, which may be 
more meaningful. And because the vertex-ID is nominal in most cases, the vertex 
with min-ID in a CC is somewhat meanless.

But there are still two disadvantages:
1,The message body grows, from (VID) to (VID, Degree). that is (Long) -> (Long, 
Int)
2,For graph with simple CCs, it may be slower than old one. Because it need a 
extra degree computation.

The api is the same like ConnectedComponents:

val graph = ...
val cc = graph.ConnectedComponentsWithDegree(100)
or
val cc = ConnectedComponentsWithDegree.run(graph, 100)


  was:
Current ConnectedComponents algorithm was based on Min-VertexId Propagation, 
which is sensitive to the place of Min-VertexId.
While this implementation is based on Max-Degree Propagation.
First, the degree graph is computed. And in the pregel progress, the vertex 
with the max degree in a CC is the start point of propagation.
This new method has advantages over the old one:
1, The convergence is only determined by the structs of CC, and is robust to 
the place of vertex with Min-ID.
2, For spherical CCs in which there may be a concept like 'center', it can 
accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), 
the old CC need 4 supersteps, while the new one only need 2 supersteps.
3, If we limit the number of iteration, the new method tend to generate more 
acceptable results.
4, The output for each CC is the vertex with max degree in it, which may be 
more meaningful. And because the vertex-ID is nominal in most cases, the vertex 
with min-ID in a CC is somewhat meanless.

But there are still two disadvantages:
1,The message boy grows, from (VID) to (VID, Degree). that is (Long) -> (Long, 
Int)
2,For graph with simple CCs, it may be slower than old one. Because it need a 
extra degree computation.

The api is the same like ConnectedComponents:

val graph = ...
val cc = graph.ConnectedComponentsWithDegree(100)
or
val cc = ConnectedComponentsWithDegree.run(graph, 100)



> Another ConnectedComponents based on Max-Degree Propagation
> ---
>
> Key: SPARK-13714
> URL: https://issues.apache.org/jira/browse/SPARK-13714
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: zhengruifeng
>Priority: Minor
>
> Current ConnectedComponents algorithm was based on Min-VertexId Propagation, 
> which is sensitive to the place of Min-VertexId.
> While this implementation is based on Max-Degree Propagation.
> First, the degree graph is computed. And in the pregel progress, the vertex 
> with the max degree in a CC is the start point of propagation.
> This new method has advantages over the old one:
> 1, The convergence is only determined by the structs of CC, and is robust to 
> the place of vertex with Min-ID.
> 2, For spherical CCs in which there may be a concept like 'center', it can 
> accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), 
> the old CC need 4 supersteps, while the new one only need 2 supersteps.
> 3, If we limit the number of iteration, the new method tend to generate more 
> acceptable results.
> 4, The output for each CC is the vertex with max degree in it, which may be 
> more meaningful. And because the vertex-ID is nominal in most cases, the 
> vertex with min-ID in a CC is somewhat meanless.
> But there are still two disadvantages:
> 1,The message body grows, from (VID) to (VID, Degree). that is (Long) -> 
> (Long, Int)
> 2,For graph with simple CCs, it may be slower than old one. Because it need a 
> extra degree computation.
> The api is the same like ConnectedComponents:
> val graph = ...
> val cc = graph.ConnectedComponentsWithDegree(100)
> or
> val cc = ConnectedComponentsWithDegree.run(graph, 100)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SPARK-13712) Add OneVsOne to ML

2016-03-06 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13712:


 Summary: Add OneVsOne to ML
 Key: SPARK-13712
 URL: https://issues.apache.org/jira/browse/SPARK-13712
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: zhengruifeng
Priority: Minor


Another Meta method for multi-class classification.

Most classification algorithms were designed for balanced data.
The OneVsRest method will generate K models on imbalanced data.
The OneVsOne will train K*(K-1)/2 models on balanced data.

OneVsOne is less sensitive to the problems of imbalanced datasets, and can 
usually result in higher precision.
But it is much more computationally expensive, although each model are trained 
on a much smaller dataset. (2/K of total)


The OneVsOne is implemented in the way OneVsRest did:

val classifier = new LogisticRegression()
val ovo = new OneVsOne()
ovo.setClassifier(classifier)
val ovoModel = ovo.fit(data)
val predictions = ovoModel.transform(data)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14352) approxQuantile should support multi columns

2016-04-03 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14352:


 Summary: approxQuantile should support multi columns
 Key: SPARK-14352
 URL: https://issues.apache.org/jira/browse/SPARK-14352
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: zhengruifeng


It will be convenient and efficient to calculate quantiles of multi-columns 
with approxQuantile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14272) Evaluate GaussianMixtureModel with LogLooklihood

2016-03-30 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14272:


 Summary: Evaluate GaussianMixtureModel with LogLooklihood
 Key: SPARK-14272
 URL: https://issues.apache.org/jira/browse/SPARK-14272
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: zhengruifeng
Priority: Minor


GMM use EM to maximum the likelihood of data. So likelihood can be a useful 
metric to evaluate GaussianMixtureModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14339) Add python examples for DCT,MinMaxScaler,MaxAbsScaler

2016-04-01 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-14339:


 Summary: Add python examples for DCT,MinMaxScaler,MaxAbsScaler
 Key: SPARK-14339
 URL: https://issues.apache.org/jira/browse/SPARK-14339
 Project: Spark
  Issue Type: Improvement
Reporter: zhengruifeng
Priority: Minor


add three python examples



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >