[jira] [Commented] (SPARK-7008) An Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504114#comment-14504114 ] zhengruifeng commented on SPARK-7008: - thanks for this information! An Implement of Factorization Machine (LibFM) - Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-7008: Summary: An implementation of Factorization Machine (LibFM) (was: An Implement of Factorization Machine (LibFM)) An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch Attachments: FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7008) An Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-7008: Description: An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf was: An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf Summary: An Implement of Factorization Machine (LibFM) (was: Implement of Factorization Machine (LibFM)) An Implement of Factorization Machine (LibFM) - Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7008) Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-7008: Description: An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf was: An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. FM work well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf Implement of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: zhengruifeng Labels: features An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7008) Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-7008: Labels: features patch (was: features) Implement of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: zhengruifeng Labels: features, patch An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7008) Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-7008: Affects Version/s: 1.3.2 1.3.1 Implement of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7008) Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-7008: Target Version/s: 1.3.0, 1.3.1, 1.3.2 (was: 1.3.0) Implement of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7008) Implement of Factorization Machine (LibFM)
zhengruifeng created SPARK-7008: --- Summary: Implement of Factorization Machine (LibFM) Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: zhengruifeng An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. FM work well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504596#comment-14504596 ] zhengruifeng commented on SPARK-7008: - I had not considered of the size of model, because the problems which I usualy encounter have dimensionality less than 10 millions. In the situation of higher dimensionality, I think feature hashing may help to limit the number of features (not sure). The libFM had implemented four training algorithms: SGD, AdaptiveSGD, ALS and MCC. I have only implemented the SGD for regression, and I'm to carry out SGD for binary classification. In my opinion, SGD is sensitive to the learning rate: big values cause divergency while small cause long-time training. When coding, I strictly refers to LibFM. There are only two points different: LibFM use strict SGD, I use mini-batch SGD provided by MLlib; LibFM use Learning Rate as a constant, I make it decreasing with the square root of the iteration counter. So I think it's convergence may like LibFM's SGD. I'm testing the library, and the result will be post in several days. Thanks. An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch Attachments: FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-7008: Description: An implementation of Factorization Machines based on Scala and Spark MLlib. FM is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. FM works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf was: An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implementation of Factorization Machines based on Scala and Spark MLlib. FM is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. FM works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512110#comment-14512110 ] zhengruifeng edited comment on SPARK-7008 at 4/25/15 12:46 AM: --- The convergence curves of Binary Classification are ploted in attached FM_CR.xlsx. https://issues.apache.org/jira/secure/attachment/12728105/FM_CR.xlsx http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/url_combined.bz2 is used, and both SGD and LBFGS are tested. The package is submitted to spark-pacakges.org: http://spark-packages.org/package/zhengruifeng/spark-libFM was (Author: podongfeng): The convergence curves of Binary Classification are ploted in attached FM_CR.xlsx. https://issues.apache.org/jira/secure/attachment/12728105/FM_CR.xlsx http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/url_combined.bz2 is used, and both SGD and LBFGS are tested. An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512110#comment-14512110 ] zhengruifeng edited comment on SPARK-7008 at 4/25/15 12:44 AM: --- The convergence curves of Binary Classification are ploted in attached FM_CR.xlsx. https://issues.apache.org/jira/secure/attachment/12728105/FM_CR.xlsx http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/url_combined.bz2 is used, and both SGD and LBFGS are tested. was (Author: podongfeng): The convergence curves of Binary Classification are ploted in attached FM_CR.xlsx. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/url_combined.bz2 is used, and both SGD and LBFGS are tested. An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512110#comment-14512110 ] zhengruifeng commented on SPARK-7008: - The convergence curves of Binary Classification are ploted in attached FM_CR.xlsx. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/url_combined.bz2 is used, and both SGD and LBFGS are tested. An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-7008: Attachment: FM_CR.xlsx An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513780#comment-14513780 ] zhengruifeng commented on SPARK-7008: - AdaGrad works pretty well in practice, but I think there should be another issue to add it to MLlib as a new Optimizer for general usage. And In my humble opinion, it may be better to avoid binding with some specific Optimizer for new algorithms. An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implementation of Factorization Machines based on Scala and Spark MLlib. FM is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. FM works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng closed SPARK-7008. --- Resolution: Fixed An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implementation of Factorization Machines based on Scala and Spark MLlib. FM is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. FM works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11585) AssociationRules should generates all association rules with consequents of arbitrary length
[ https://issues.apache.org/jira/browse/SPARK-11585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996174#comment-14996174 ] zhengruifeng commented on SPARK-11585: -- I have implemented it based on Apriori's Rule-Generation Algorithm: https://github.com/zhengruifeng/spark-rules It's compatible with fpm's APIs. import org.apache.spark.mllib.fpm._ import org.apache.spark.mllib.fpm.FPGrowth val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat") val transactions = data.map(s => s.trim.split(' ').map(_.toInt)).persist() val fpg = new FPGrowth().setMinSupport(0.01) val model = fpg.run(transactions) val ar = new AprioriRules().setMinConfidence(0.1).setMaxConsequent(1).setNumPartitions(10) val results = ar.run(model.freqItemsets) > AssociationRules should generates all association rules with consequents of > arbitrary length > > > Key: SPARK-11585 > URL: https://issues.apache.org/jira/browse/SPARK-11585 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: zhengruifeng > > AssociationRules should generates all association rules with consequents of > arbitrary length, no just rules which have a single item as the consequent. > Such as: > 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117 > 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939 > 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007 > 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418 > 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997 > 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276 > 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11585) AssociationRules should generates all association rules with consequents of arbitrary length
[ https://issues.apache.org/jira/browse/SPARK-11585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996174#comment-14996174 ] zhengruifeng edited comment on SPARK-11585 at 11/9/15 8:11 AM: --- I have implemented it based on Apriori's Rule-Generation Algorithm: https://github.com/zhengruifeng/spark-rules It's compatible with fpm's APIs. import org.apache.spark.mllib.fpm._ val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat") val transactions = data.map(s => s.trim.split(' ').map(_.toInt)).persist() val fpg = new FPGrowth().setMinSupport(0.01) val model = fpg.run(transactions) val ar = new AprioriRules().setMinConfidence(0.1).setMaxConsequent(15).setNumPartitions(10) val results = ar.run(model.freqItemsets) and it output rule-generation infomation like this: 15/11/04 11:28:46 INFO AprioriRules: Candidates for 1-consequent rules : 312917 15/11/04 11:28:58 INFO AprioriRules: Generated 1-consequent rules : 306703 15/11/04 11:29:10 INFO AprioriRules: Candidates for 2-consequent rules : 707747 15/11/04 11:29:35 INFO AprioriRules: Generated 2-consequent rules : 704000 15/11/04 11:29:55 INFO AprioriRules: Candidates for 3-consequent rules : 1020253 15/11/04 11:30:38 INFO AprioriRules: Generated 3-consequent rules : 1014002 15/11/04 11:31:14 INFO AprioriRules: Candidates for 4-consequent rules : 972225 15/11/04 11:32:00 INFO AprioriRules: Generated 4-consequent rules : 956483 15/11/04 11:32:44 INFO AprioriRules: Candidates for 5-consequent rules : 653749 15/11/04 11:33:32 INFO AprioriRules: Generated 5-consequent rules : 626993 15/11/04 11:34:07 INFO AprioriRules: Candidates for 6-consequent rules : 331038 15/11/04 11:34:50 INFO AprioriRules: Generated 6-consequent rules : 314455 15/11/04 11:35:10 INFO AprioriRules: Candidates for 7-consequent rules : 138490 15/11/04 11:35:43 INFO AprioriRules: Generated 7-consequent rules : 136260 15/11/04 11:35:57 INFO AprioriRules: Candidates for 8-consequent rules : 48567 15/11/04 11:36:14 INFO AprioriRules: Generated 8-consequent rules : 47331 15/11/04 11:36:24 INFO AprioriRules: Candidates for 9-consequent rules : 12430 15/11/04 11:36:33 INFO AprioriRules: Generated 9-consequent rules : 11925 15/11/04 11:36:37 INFO AprioriRules: Candidates for 10-consequent rules : 2211 15/11/04 11:36:47 INFO AprioriRules: Generated 10-consequent rules : 2064 15/11/04 11:36:55 INFO AprioriRules: Candidates for 11-consequent rules : 246 15/11/04 11:36:58 INFO AprioriRules: Generated 11-consequent rules : 219 15/11/04 11:37:00 INFO AprioriRules: Candidates for 12-consequent rules : 13 15/11/04 11:37:03 INFO AprioriRules: Generated 12-consequent rules : 11 15/11/04 11:37:03 INFO AprioriRules: Candidates for 13-consequent rules : 0 was (Author: podongfeng): I have implemented it based on Apriori's Rule-Generation Algorithm: https://github.com/zhengruifeng/spark-rules It's compatible with fpm's APIs. import org.apache.spark.mllib.fpm._ import org.apache.spark.mllib.fpm.FPGrowth val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat") val transactions = data.map(s => s.trim.split(' ').map(_.toInt)).persist() val fpg = new FPGrowth().setMinSupport(0.01) val model = fpg.run(transactions) val ar = new AprioriRules().setMinConfidence(0.1).setMaxConsequent(1).setNumPartitions(10) val results = ar.run(model.freqItemsets) > AssociationRules should generates all association rules with consequents of > arbitrary length > > > Key: SPARK-11585 > URL: https://issues.apache.org/jira/browse/SPARK-11585 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: zhengruifeng > > AssociationRules should generates all association rules with consequents of > arbitrary length, no just rules which have a single item as the consequent. > Such as: > 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117 > 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939 > 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007 > 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418 > 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997 > 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276 > 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11585) AssociationRules should generates all association rules with consequents of arbitrary length
zhengruifeng created SPARK-11585: Summary: AssociationRules should generates all association rules with consequents of arbitrary length Key: SPARK-11585 URL: https://issues.apache.org/jira/browse/SPARK-11585 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: zhengruifeng AssociationRules should generates all association rules with consequents of arbitrary length, no just rules which have a single item as the consequent. Such as: 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874 ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11585) AssociationRules should generates all association rules with consequents of arbitrary length
[ https://issues.apache.org/jira/browse/SPARK-11585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-11585: - Attachment: rule-generation.pdf Apriori's Rule Generation Algorithm > AssociationRules should generates all association rules with consequents of > arbitrary length > > > Key: SPARK-11585 > URL: https://issues.apache.org/jira/browse/SPARK-11585 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: zhengruifeng > Attachments: rule-generation.pdf > > > AssociationRules should generates all association rules with consequents of > arbitrary length, no just rules which have a single item as the consequent. > Such as: > 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117 > 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939 > 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007 > 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418 > 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997 > 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276 > 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621830#comment-14621830 ] zhengruifeng commented on SPARK-7008: - Yes, LBFGS provide a faster convergence rate. An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Reporter: zhengruifeng Labels: features Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implementation of Factorization Machines based on Scala and Spark MLlib. FM is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. FM works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15770) 'Experimental' annotation audit
zhengruifeng created SPARK-15770: Summary: 'Experimental' annotation audit Key: SPARK-15770 URL: https://issues.apache.org/jira/browse/SPARK-15770 Project: Spark Issue Type: Improvement Components: ML Reporter: zhengruifeng Priority: Trivial 1, remove comments {:: Experimental ::} for non-experimental API 2, add comments {:: Experimental ::} for experimental API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15770) 'Experimental' annotation audit
[ https://issues.apache.org/jira/browse/SPARK-15770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15770: - Description: 1, remove comments {{:: Experimental ::}} for non-experimental API 2, add comments {{:: Experimental ::}} for experimental API was: 1, remove comments {:: Experimental ::} for non-experimental API 2, add comments {:: Experimental ::} for experimental API > 'Experimental' annotation audit > --- > > Key: SPARK-15770 > URL: https://issues.apache.org/jira/browse/SPARK-15770 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Trivial > > 1, remove comments {{:: Experimental ::}} for non-experimental API > 2, add comments {{:: Experimental ::}} for experimental API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15770) annotation audit for Experimental and DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-15770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15770: - Description: 1, remove comments {{:: Experimental ::}} for non-experimental API 2, add comments {{:: Experimental ::}} for experimental API 3, add comments {{:: Experimental ::}} for experimental API was: 1, remove comments {{:: Experimental ::}} for non-experimental API 2, add comments {{:: Experimental ::}} for experimental API > annotation audit for Experimental and DeveloperApi > -- > > Key: SPARK-15770 > URL: https://issues.apache.org/jira/browse/SPARK-15770 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Trivial > > 1, remove comments {{:: Experimental ::}} for non-experimental API > 2, add comments {{:: Experimental ::}} for experimental API > 3, add comments {{:: Experimental ::}} for experimental API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15770) annotation audit for Experimental and DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-15770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15770: - Description: 1, remove comments {{:: Experimental ::}} for non-experimental API 2, add comments {{:: Experimental ::}} for experimental API 3, add comments {{:: DeveloperApi ::}} for developerApi API was: 1, remove comments {{:: Experimental ::}} for non-experimental API 2, add comments {{:: Experimental ::}} for experimental API 3, add comments {{:: Experimental ::}} for experimental API > annotation audit for Experimental and DeveloperApi > -- > > Key: SPARK-15770 > URL: https://issues.apache.org/jira/browse/SPARK-15770 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Trivial > > 1, remove comments {{:: Experimental ::}} for non-experimental API > 2, add comments {{:: Experimental ::}} for experimental API > 3, add comments {{:: DeveloperApi ::}} for developerApi API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15770) annotation audit for Experimental and DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-15770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15770: - Summary: annotation audit for Experimental and DeveloperApi (was: 'Experimental' annotation audit) > annotation audit for Experimental and DeveloperApi > -- > > Key: SPARK-15770 > URL: https://issues.apache.org/jira/browse/SPARK-15770 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Trivial > > 1, remove comments {{:: Experimental ::}} for non-experimental API > 2, add comments {{:: Experimental ::}} for experimental API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics
[ https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15322308#comment-15322308 ] zhengruifeng edited comment on SPARK-15823 at 6/9/16 10:20 AM: --- {{MulticlassMetrics.confusionMatrix}} may need {{@property}} too, but I am not sure. Others seem ok. was (Author: podongfeng): {MulticlassMetrics.confusionMatrix} may need {@property} too, but I am not sure. Others seem ok. > Add @property for 'accuracy' in MulticlassMetrics > - > > Key: SPARK-15823 > URL: https://issues.apache.org/jira/browse/SPARK-15823 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng >Priority: Minor > > 'accuracy' should be decorated with `@property` to keep step with other > methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, > `weightedRecall`, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics
[ https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15322309#comment-15322309 ] zhengruifeng commented on SPARK-15823: -- {MulticlassMetrics.confusionMatrix} may need {@property} too, but I am not sure. Others seem ok. > Add @property for 'accuracy' in MulticlassMetrics > - > > Key: SPARK-15823 > URL: https://issues.apache.org/jira/browse/SPARK-15823 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng >Priority: Minor > > 'accuracy' should be decorated with `@property` to keep step with other > methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, > `weightedRecall`, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics
[ https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15823: - Comment: was deleted (was: {MulticlassMetrics.confusionMatrix} may need {@property} too, but I am not sure. Others seem ok.) > Add @property for 'accuracy' in MulticlassMetrics > - > > Key: SPARK-15823 > URL: https://issues.apache.org/jira/browse/SPARK-15823 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng >Priority: Minor > > 'accuracy' should be decorated with `@property` to keep step with other > methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, > `weightedRecall`, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics
[ https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15322308#comment-15322308 ] zhengruifeng commented on SPARK-15823: -- {MulticlassMetrics.confusionMatrix} may need {@property} too, but I am not sure. Others seem ok. > Add @property for 'accuracy' in MulticlassMetrics > - > > Key: SPARK-15823 > URL: https://issues.apache.org/jira/browse/SPARK-15823 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng >Priority: Minor > > 'accuracy' should be decorated with `@property` to keep step with other > methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, > `weightedRecall`, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics
[ https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15823: - Summary: Add @property for 'accuracy' in MulticlassMetrics (was: Add @property for 'property' in MulticlassMetrics) > Add @property for 'accuracy' in MulticlassMetrics > - > > Key: SPARK-15823 > URL: https://issues.apache.org/jira/browse/SPARK-15823 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng >Priority: Minor > > 'accuracy' should be decorated with `@property` to keep step with other > methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, > `weightedRecall`, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15823) Add @property for 'property' in MulticlassMetrics
zhengruifeng created SPARK-15823: Summary: Add @property for 'property' in MulticlassMetrics Key: SPARK-15823 URL: https://issues.apache.org/jira/browse/SPARK-15823 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: zhengruifeng Priority: Minor 'accuracy' should be decorated with `@property` to keep step with other methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, `weightedRecall`, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15617) Clarify that fMeasure in MulticlassMetrics and MulticlassClassificationEvaluator is "micro" f1_score
[ https://issues.apache.org/jira/browse/SPARK-15617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305089#comment-15305089 ] zhengruifeng commented on SPARK-15617: -- I can work on this > Clarify that fMeasure in MulticlassMetrics and > MulticlassClassificationEvaluator is "micro" f1_score > > > Key: SPARK-15617 > URL: https://issues.apache.org/jira/browse/SPARK-15617 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > See description in sklearn docs: > [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html] > I believe we are calculating the "micro" average for {{val fMeasure: > Double}}. We should clarify this in the docs. > I'm not sure if "micro" is a common term, so we should check other libraries > too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15617) Clarify that fMeasure in MulticlassMetrics and MulticlassClassificationEvaluator is "micro" f1_score
[ https://issues.apache.org/jira/browse/SPARK-15617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305086#comment-15305086 ] zhengruifeng commented on SPARK-15617: -- Revolutions(http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html#micro) also call it `Micro-averaged Metrics` > Clarify that fMeasure in MulticlassMetrics and > MulticlassClassificationEvaluator is "micro" f1_score > > > Key: SPARK-15617 > URL: https://issues.apache.org/jira/browse/SPARK-15617 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > See description in sklearn docs: > [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html] > I believe we are calculating the "micro" average for {{val fMeasure: > Double}}. We should clarify this in the docs. > I'm not sure if "micro" is a common term, so we should check other libraries > too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305239#comment-15305239 ] zhengruifeng commented on SPARK-15581: -- In regard to gbt, xgboost4j may be involved > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose backwards compatibility for persistence (SPARK-15573) > h2. Python API > * Standardize unit tests for Scala and Python to improve and consolidate test > coverage for Params, persistence, and other common functionality (SPARK-15571) > * Improve Python API handling of Params, persistence (SPARK-14771) > (SPARK-14706) > ** Note: The linked JIRAs for this are incomplete. More to be created... > ** Related: Implement Python meta-algorithms in Scala (to simplify > persistence) (SPARK-15574) > * Feature parity: The main goal of the Python API is to have feature parity > with the Scala/Java API. You can find a [complete list here| >
[jira] [Created] (SPARK-15939) Clarify ml.linalg usage
zhengruifeng created SPARK-15939: Summary: Clarify ml.linalg usage Key: SPARK-15939 URL: https://issues.apache.org/jira/browse/SPARK-15939 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: zhengruifeng Priority: Trivial 1, update comments in {pyspark.ml} that it use {ml.linalg} not {mllib.linalg} 2, rename {MLlibTestCase} to {MLTestCase} in {ml.tests.py} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15939) Clarify ml.linalg usage
[ https://issues.apache.org/jira/browse/SPARK-15939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15939: - Description: 1, update comments in {{pyspark.ml}} that it use {{ml.linalg}} not {{mllib.linalg}} 2, rename {{MLlibTestCase}} to {{MLTestCase}} in {{ml.tests.py}} was: 1, update comments in {{pyspark.ml}} that it use {ml.linalg} not {mllib.linalg} 2, rename {MLlibTestCase} to {MLTestCase} in {ml.tests.py} > Clarify ml.linalg usage > --- > > Key: SPARK-15939 > URL: https://issues.apache.org/jira/browse/SPARK-15939 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng >Priority: Trivial > > 1, update comments in {{pyspark.ml}} that it use {{ml.linalg}} not > {{mllib.linalg}} > 2, rename {{MLlibTestCase}} to {{MLTestCase}} in {{ml.tests.py}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15939) Clarify ml.linalg usage
[ https://issues.apache.org/jira/browse/SPARK-15939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15939: - Description: 1, update comments in {{pyspark.ml}} that it use {ml.linalg} not {mllib.linalg} 2, rename {MLlibTestCase} to {MLTestCase} in {ml.tests.py} was: 1, update comments in {pyspark.ml} that it use {ml.linalg} not {mllib.linalg} 2, rename {MLlibTestCase} to {MLTestCase} in {ml.tests.py} > Clarify ml.linalg usage > --- > > Key: SPARK-15939 > URL: https://issues.apache.org/jira/browse/SPARK-15939 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng >Priority: Trivial > > 1, update comments in {{pyspark.ml}} that it use {ml.linalg} not > {mllib.linalg} > 2, rename {MLlibTestCase} to {MLTestCase} in {ml.tests.py} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15650) Add correctness test for MulticlassClassification
zhengruifeng created SPARK-15650: Summary: Add correctness test for MulticlassClassification Key: SPARK-15650 URL: https://issues.apache.org/jira/browse/SPARK-15650 Project: Spark Issue Type: Improvement Components: ML Reporter: zhengruifeng Priority: Minor {{BinaryClassificationEvaluatorSuite}} and {{RegressionEvaluatorSuite}} have tests for correctness checking, while {{MulticlassClassificationEvaluatorSuite}} do not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15614) ml.feature should support default value of input column
[ https://issues.apache.org/jira/browse/SPARK-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306419#comment-15306419 ] zhengruifeng commented on SPARK-15614: -- Agreed. What about setting the default value of {{setInputCol}} if the algorithm takes features as input? > ml.feature should support default value of input column > --- > > Key: SPARK-15614 > URL: https://issues.apache.org/jira/browse/SPARK-15614 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {{ml.clasification}} and {{ml.clustering}} use {{"features"}} as default > input column. While {{ml.feature}} use {{setInputCol}} method to set input > column and don't have default value, which is somewhat strange. > It may be nice to support default input column "features" in {{ml.feature}}, > and we can make these implements extends {{HasFeaturesCol}} and make existing > {{setInputCol}} method just a alias. > I can work on this if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15650) Add correctness test for MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-15650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15650: - Summary: Add correctness test for MulticlassClassificationEvaluator (was: Add correctness test for MulticlassClassification) > Add correctness test for MulticlassClassificationEvaluator > -- > > Key: SPARK-15650 > URL: https://issues.apache.org/jira/browse/SPARK-15650 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {{BinaryClassificationEvaluatorSuite}} and {{RegressionEvaluatorSuite}} have > tests for correctness checking, while > {{MulticlassClassificationEvaluatorSuite}} do not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15291) Remove redundant codes in SVD++
[ https://issues.apache.org/jira/browse/SPARK-15291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng closed SPARK-15291. Resolution: Won't Fix > Remove redundant codes in SVD++ > --- > > Key: SPARK-15291 > URL: https://issues.apache.org/jira/browse/SPARK-15291 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: zhengruifeng >Priority: Minor > > {code} > val newVertices = g.vertices.mapValues(v => (v._1.toArray, v._2.toArray, > v._3, v._4)) > (Graph(newVertices, g.edges), u) > {code} > is just the same as > {code} > (g, u) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15607) Remove redundant toArray in ml.linalg
[ https://issues.apache.org/jira/browse/SPARK-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng closed SPARK-15607. Resolution: Won't Fix > Remove redundant toArray in ml.linalg > - > > Key: SPARK-15607 > URL: https://issues.apache.org/jira/browse/SPARK-15607 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {{sliceInds, sliceVals}} are already of type {{Array}}, so remove {{toArray}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15610) update error message for k in pca
[ https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15610: - Summary: update error message for k in pca (was: PCA should not support k == numFeatures) > update error message for k in pca > - > > Key: SPARK-15610 > URL: https://issues.apache.org/jira/browse/SPARK-15610 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: zhengruifeng >Priority: Minor > > Vector size must be greater than {{k}}, but now it support {{k == > vector.size}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15610) PCA should not support k == numFeatures
[ https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15610: - Priority: Minor (was: Major) > PCA should not support k == numFeatures > --- > > Key: SPARK-15610 > URL: https://issues.apache.org/jira/browse/SPARK-15610 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: zhengruifeng >Priority: Minor > > Vector size must be greater than {{k}}, but now it support {{k == > vector.size}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15610) update error message for k in pca
[ https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15610: - Description: error message for {{k}} should match the bound (was: Vector size must be greater than {{k}}, but now it support {{k == vector.size}}) > update error message for k in pca > - > > Key: SPARK-15610 > URL: https://issues.apache.org/jira/browse/SPARK-15610 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: zhengruifeng >Priority: Minor > > error message for {{k}} should match the bound -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15607) Remove redundant toArray in ml.linalg
zhengruifeng created SPARK-15607: Summary: Remove redundant toArray in ml.linalg Key: SPARK-15607 URL: https://issues.apache.org/jira/browse/SPARK-15607 Project: Spark Issue Type: Improvement Components: ML Reporter: zhengruifeng Priority: Minor {{sliceInds, sliceVals}} are already of type {{Array}}, so remove {{toArray}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15614) ml.feature should support default value of input column
[ https://issues.apache.org/jira/browse/SPARK-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15614: - Priority: Minor (was: Major) > ml.feature should support default value of input column > --- > > Key: SPARK-15614 > URL: https://issues.apache.org/jira/browse/SPARK-15614 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {{ml.clasification}} and {{ml.clustering}} use {{"features"}} as default > input column. While {{ml.feature}} use {{setInputCol}} method to set input > column and don't have default value, which is somewhat strange. > It may be nice to support default input column "features" in {{ml.feature}}, > and we can make these implements extends {{HasFeaturesCol}} and make existing > {{setInputCol}} method just a alias. > I can work on this if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15614) ml.feature should support default value of input column
zhengruifeng created SPARK-15614: Summary: ml.feature should support default value of input column Key: SPARK-15614 URL: https://issues.apache.org/jira/browse/SPARK-15614 Project: Spark Issue Type: Brainstorming Components: ML Reporter: zhengruifeng {{ml.clasification}} and {{ml.clustering}} use {{"features"}} as default input column. While {{ml.feature}} use {{setInputCol}} method to set input column and don't have default value, which is somewhat strange. It may be nice to support default input column "features" in {{ml.feature}}, and we can make these implements extends {{HasFeaturesCol}} and make existing {{setInputCol}} method just a alias. I can work on this if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15614) ml.feature should support default value of input column
[ https://issues.apache.org/jira/browse/SPARK-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303971#comment-15303971 ] zhengruifeng commented on SPARK-15614: -- [~josephkb] [~mengxr] [~yanboliang] any thoughts? > ml.feature should support default value of input column > --- > > Key: SPARK-15614 > URL: https://issues.apache.org/jira/browse/SPARK-15614 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {{ml.clasification}} and {{ml.clustering}} use {{"features"}} as default > input column. While {{ml.feature}} use {{setInputCol}} method to set input > column and don't have default value, which is somewhat strange. > It may be nice to support default input column "features" in {{ml.feature}}, > and we can make these implements extends {{HasFeaturesCol}} and make existing > {{setInputCol}} method just a alias. > I can work on this if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15614) ml.feature should support default value of input column
[ https://issues.apache.org/jira/browse/SPARK-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303971#comment-15303971 ] zhengruifeng edited comment on SPARK-15614 at 5/27/16 11:54 AM: [~josephkb] [~mengxr] [~yanboliang] [~mlnick] any thoughts? was (Author: podongfeng): [~josephkb] [~mengxr] [~yanboliang] any thoughts? > ml.feature should support default value of input column > --- > > Key: SPARK-15614 > URL: https://issues.apache.org/jira/browse/SPARK-15614 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {{ml.clasification}} and {{ml.clustering}} use {{"features"}} as default > input column. While {{ml.feature}} use {{setInputCol}} method to set input > column and don't have default value, which is somewhat strange. > It may be nice to support default input column "features" in {{ml.feature}}, > and we can make these implements extends {{HasFeaturesCol}} and make existing > {{setInputCol}} method just a alias. > I can work on this if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15610) PCA should not support k == numFeatures
zhengruifeng created SPARK-15610: Summary: PCA should not support k == numFeatures Key: SPARK-15610 URL: https://issues.apache.org/jira/browse/SPARK-15610 Project: Spark Issue Type: Bug Components: ML Reporter: zhengruifeng Vector size must be greater than {{k}}, but now it support {{k == vector.size}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15617) Clarify that fMeasure in MulticlassMetrics and MulticlassClassificationEvaluator is "micro" f1_score
[ https://issues.apache.org/jira/browse/SPARK-15617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311752#comment-15311752 ] zhengruifeng commented on SPARK-15617: -- Agreed. In {{MulticlassClassificationEvaluator}}, I will remove precision/recall but keep f1 (weighted averaged f1-measure, not equal to accury) For {{MulticlassMetrics}}, I will just update the user guide. Is this OK? > Clarify that fMeasure in MulticlassMetrics and > MulticlassClassificationEvaluator is "micro" f1_score > > > Key: SPARK-15617 > URL: https://issues.apache.org/jira/browse/SPARK-15617 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > See description in sklearn docs: > [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html] > I believe we are calculating the "micro" average for {{val fMeasure: > Double}}. We should clarify this in the docs. > I'm not sure if "micro" is a common term, so we should check other libraries > too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13435) Add Weighted Cohen's kappa to MulticlassMetrics
zhengruifeng created SPARK-13435: Summary: Add Weighted Cohen's kappa to MulticlassMetrics Key: SPARK-13435 URL: https://issues.apache.org/jira/browse/SPARK-13435 Project: Spark Issue Type: Improvement Components: MLlib Reporter: zhengruifeng Add the missing Weighted Cohen's kappa to MulticlassMetrics. Kappa is widely used in Competition and Statistics. https://en.wikipedia.org/wiki/Cohen's_kappa Some usage examples: val metrics = new MulticlassMetrics(predictionAndLabels) // The default kappa value (Unweighted kappa) val kappa = metrics.kappa // Three built-in weighting type ("default":unweighted, "linear":linear weighted, "quadratic":quadratic weighted) val kappa = metrics.kappa("quadratic") // User-defined weighting matrix val matrix = Matrices.dense(n, n, values) val kappa = metrics.kappa(matrix) // User-defined weighting function def getWeight(i: Int, j:Int):Double = { if (i == j) { 0.0 } else { 1.0 } } val kappa = metrics.kappa(getWeight) // equals to the unweighted kappa The calculation correctness was tested on several small data, and compared to two python's package: sklearn and ml_metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13506) Fix the wrong parameter in R code comment in AssociationRulesSuite
zhengruifeng created SPARK-13506: Summary: Fix the wrong parameter in R code comment in AssociationRulesSuite Key: SPARK-13506 URL: https://issues.apache.org/jira/browse/SPARK-13506 Project: Spark Issue Type: Bug Components: MLlib Reporter: zhengruifeng Priority: Trivial The following R Snippet in AssociationRulesSuite is wrong: /* Verify results using the `R` code: transactions = as(sapply( list("r z h k p", "z y x w v u t s", "s x o n r", "x z y m t s q e", "z", "x z y r q t p"), FUN=function(x) strsplit(x," ",fixed=TRUE)), "transactions") ars = apriori(transactions, parameter = list(support = 0.0, confidence = 0.5, target="rules", minlen=2)) arsDF = as(ars, "data.frame") arsDF$support = arsDF$support * length(transactions) names(arsDF)[names(arsDF) == "support"] = "freq" > nrow(arsDF) [1] 23 > sum(arsDF$confidence == 1) [1] 23 */ The real outputs are: > nrow(arsDF) [1] 441838 > sum(arsDF$confidence == 1) [1] 441592 It is found that the parameters in apriori function were wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13538) Add GaussianMixture to ML
zhengruifeng created SPARK-13538: Summary: Add GaussianMixture to ML Key: SPARK-13538 URL: https://issues.apache.org/jira/browse/SPARK-13538 Project: Spark Issue Type: Improvement Components: ML Reporter: zhengruifeng Priority: Minor Add GaussianMixture and GaussianMixtureModel to ML package -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13550) Add java example for ml.clustering.BisectingKMeans
zhengruifeng created SPARK-13550: Summary: Add java example for ml.clustering.BisectingKMeans Key: SPARK-13550 URL: https://issues.apache.org/jira/browse/SPARK-13550 Project: Spark Issue Type: Improvement Components: ML Reporter: zhengruifeng Priority: Trivial Add java example for ml.clustering.BisectingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13551) Fix fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample
zhengruifeng created SPARK-13551: Summary: Fix fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample Key: SPARK-13551 URL: https://issues.apache.org/jira/browse/SPARK-13551 Project: Spark Issue Type: Improvement Components: MLlib Reporter: zhengruifeng Priority: Trivial this description is wrong: /** * Java example for graph clustering using power iteration clustering (PIC). */ this for loop is meanless: for (Vector center: model.clusterCenters()) { System.out.println(""); } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13435) Add Weighted Cohen's kappa to MulticlassMetrics
[ https://issues.apache.org/jira/browse/SPARK-13435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157015#comment-15157015 ] zhengruifeng commented on SPARK-13435: -- I dont think so. Recently, many Competitions use quadratic weighted kappa as the evaluation metrics. such as: https://www.kaggle.com/c/diabetic-retinopathy-detection/details/evaluation https://www.kaggle.com/c/prudential-life-insurance-assessment/details/evaluation ... The unweighted kappa is very easy to compute, especially for binary classification. But the weighted one is not so obvious, and cause many confusion. You can find in kaggle's forum that many people are confused with it. > Add Weighted Cohen's kappa to MulticlassMetrics > --- > > Key: SPARK-13435 > URL: https://issues.apache.org/jira/browse/SPARK-13435 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > Add the missing Weighted Cohen's kappa to MulticlassMetrics. > Kappa is widely used in Competition and Statistics. > https://en.wikipedia.org/wiki/Cohen's_kappa > Some usage examples: > val metrics = new MulticlassMetrics(predictionAndLabels) > // The default kappa value (Unweighted kappa) > val kappa = metrics.kappa > // Three built-in weighting type ("default":unweighted, "linear":linear > weighted, "quadratic":quadratic weighted) > val kappa = metrics.kappa("quadratic") > // User-defined weighting matrix > val matrix = Matrices.dense(n, n, values) > val kappa = metrics.kappa(matrix) > // User-defined weighting function > def getWeight(i: Int, j:Int):Double = { > if (i == j) { > 0.0 > } else { > 1.0 > } > } > val kappa = metrics.kappa(getWeight) // equals to the unweighted kappa > The calculation correctness was tested on several small data, and compared to > two python's package: sklearn and ml_metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13385) Enable AssociationRules to generate consequents with user-defined lengths
zhengruifeng created SPARK-13385: Summary: Enable AssociationRules to generate consequents with user-defined lengths Key: SPARK-13385 URL: https://issues.apache.org/jira/browse/SPARK-13385 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.6.0 Reporter: zhengruifeng AssociationRules should generates all association rules with user-defined iterations, no just rules which have a single item as the consequent. Such as: 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874 ... I have implemented it based on Apriori's Rule-Generation Algorithm: https://github.com/zhengruifeng/spark-rules It's compatible with fpm's APIs. import org.apache.spark.mllib.fpm._ val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat") val transactions = data.map(s => s.trim.split(' ')).persist() val fpg = new FPGrowth().setMinSupport(0.01) val model = fpg.run(transactions) val ar = new AprioriRules().setMinConfidence(0.1).setMaxConsequent(15) val results = ar.run(model.freqItemsets) and it output rule-generation infomation like this: 15/11/04 11:28:46 INFO AprioriRules: Candidates for 1-consequent rules : 312917 15/11/04 11:28:58 INFO AprioriRules: Generated 1-consequent rules : 306703 15/11/04 11:29:10 INFO AprioriRules: Candidates for 2-consequent rules : 707747 15/11/04 11:29:35 INFO AprioriRules: Generated 2-consequent rules : 704000 15/11/04 11:29:55 INFO AprioriRules: Candidates for 3-consequent rules : 1020253 15/11/04 11:30:38 INFO AprioriRules: Generated 3-consequent rules : 1014002 15/11/04 11:31:14 INFO AprioriRules: Candidates for 4-consequent rules : 972225 15/11/04 11:32:00 INFO AprioriRules: Generated 4-consequent rules : 956483 15/11/04 11:32:44 INFO AprioriRules: Candidates for 5-consequent rules : 653749 15/11/04 11:33:32 INFO AprioriRules: Generated 5-consequent rules : 626993 15/11/04 11:34:07 INFO AprioriRules: Candidates for 6-consequent rules : 331038 15/11/04 11:34:50 INFO AprioriRules: Generated 6-consequent rules : 314455 15/11/04 11:35:10 INFO AprioriRules: Candidates for 7-consequent rules : 138490 15/11/04 11:35:43 INFO AprioriRules: Generated 7-consequent rules : 136260 15/11/04 11:35:57 INFO AprioriRules: Candidates for 8-consequent rules : 48567 15/11/04 11:36:14 INFO AprioriRules: Generated 8-consequent rules : 47331 15/11/04 11:36:24 INFO AprioriRules: Candidates for 9-consequent rules : 12430 15/11/04 11:36:33 INFO AprioriRules: Generated 9-consequent rules : 11925 15/11/04 11:36:37 INFO AprioriRules: Candidates for 10-consequent rules : 2211 15/11/04 11:36:47 INFO AprioriRules: Generated 10-consequent rules : 2064 15/11/04 11:36:55 INFO AprioriRules: Candidates for 11-consequent rules : 246 15/11/04 11:36:58 INFO AprioriRules: Generated 11-consequent rules : 219 15/11/04 11:37:00 INFO AprioriRules: Candidates for 12-consequent rules : 13 15/11/04 11:37:03 INFO AprioriRules: Generated 12-consequent rules : 11 15/11/04 11:37:03 INFO AprioriRules: Candidates for 13-consequent rules : 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13385) Enable AssociationRules to generate consequents with user-defined lengths
[ https://issues.apache.org/jira/browse/SPARK-13385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-13385: - Attachment: rule-generation.pdf rule-generation algorithm > Enable AssociationRules to generate consequents with user-defined lengths > - > > Key: SPARK-13385 > URL: https://issues.apache.org/jira/browse/SPARK-13385 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 >Reporter: zhengruifeng > Attachments: rule-generation.pdf > > > AssociationRules should generates all association rules with user-defined > iterations, no just rules which have a single item as the consequent. > Such as: > 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117 > 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939 > 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007 > 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418 > 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997 > 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276 > 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874 > ... > I have implemented it based on Apriori's Rule-Generation Algorithm: > https://github.com/zhengruifeng/spark-rules > It's compatible with fpm's APIs. > import org.apache.spark.mllib.fpm._ > val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat") > val transactions = data.map(s => s.trim.split(' ')).persist() > val fpg = new FPGrowth().setMinSupport(0.01) > val model = fpg.run(transactions) > val ar = new AprioriRules().setMinConfidence(0.1).setMaxConsequent(15) > val results = ar.run(model.freqItemsets) > and it output rule-generation infomation like this: > 15/11/04 11:28:46 INFO AprioriRules: Candidates for 1-consequent rules : > 312917 > 15/11/04 11:28:58 INFO AprioriRules: Generated 1-consequent rules : 306703 > 15/11/04 11:29:10 INFO AprioriRules: Candidates for 2-consequent rules : > 707747 > 15/11/04 11:29:35 INFO AprioriRules: Generated 2-consequent rules : 704000 > 15/11/04 11:29:55 INFO AprioriRules: Candidates for 3-consequent rules : > 1020253 > 15/11/04 11:30:38 INFO AprioriRules: Generated 3-consequent rules : 1014002 > 15/11/04 11:31:14 INFO AprioriRules: Candidates for 4-consequent rules : > 972225 > 15/11/04 11:32:00 INFO AprioriRules: Generated 4-consequent rules : 956483 > 15/11/04 11:32:44 INFO AprioriRules: Candidates for 5-consequent rules : > 653749 > 15/11/04 11:33:32 INFO AprioriRules: Generated 5-consequent rules : 626993 > 15/11/04 11:34:07 INFO AprioriRules: Candidates for 6-consequent rules : > 331038 > 15/11/04 11:34:50 INFO AprioriRules: Generated 6-consequent rules : 314455 > 15/11/04 11:35:10 INFO AprioriRules: Candidates for 7-consequent rules : > 138490 > 15/11/04 11:35:43 INFO AprioriRules: Generated 7-consequent rules : 136260 > 15/11/04 11:35:57 INFO AprioriRules: Candidates for 8-consequent rules : 48567 > 15/11/04 11:36:14 INFO AprioriRules: Generated 8-consequent rules : 47331 > 15/11/04 11:36:24 INFO AprioriRules: Candidates for 9-consequent rules : 12430 > 15/11/04 11:36:33 INFO AprioriRules: Generated 9-consequent rules : 11925 > 15/11/04 11:36:37 INFO AprioriRules: Candidates for 10-consequent rules : 2211 > 15/11/04 11:36:47 INFO AprioriRules: Generated 10-consequent rules : 2064 > 15/11/04 11:36:55 INFO AprioriRules: Candidates for 11-consequent rules : 246 > 15/11/04 11:36:58 INFO AprioriRules: Generated 11-consequent rules : 219 > 15/11/04 11:37:00 INFO AprioriRules: Candidates for 12-consequent rules : 13 > 15/11/04 11:37:03 INFO AprioriRules: Generated 12-consequent rules : 11 > 15/11/04 11:37:03 INFO AprioriRules: Candidates for 13-consequent rules : 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13386) ConnectedComponents should support maxIteration option
zhengruifeng created SPARK-13386: Summary: ConnectedComponents should support maxIteration option Key: SPARK-13386 URL: https://issues.apache.org/jira/browse/SPARK-13386 Project: Spark Issue Type: Improvement Components: GraphX Reporter: zhengruifeng Runing ConnectedComponents is time-consuming on big and complex graph. I use it on a graph with 1.7B vertices and 11B edges, and the exact result is not a must. So I think user can directly control the maxIteration of this algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13416) Add positive check for option 'numIter' in StronglyConnectedComponents
zhengruifeng created SPARK-13416: Summary: Add positive check for option 'numIter' in StronglyConnectedComponents Key: SPARK-13416 URL: https://issues.apache.org/jira/browse/SPARK-13416 Project: Spark Issue Type: Bug Components: GraphX Reporter: zhengruifeng Priority: Minor The output of StronglyConnectedComponents with numIter no greater than 1 may make no sense. So I just add require check in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13814) Delete unnecessary imports in python examples files
zhengruifeng created SPARK-13814: Summary: Delete unnecessary imports in python examples files Key: SPARK-13814 URL: https://issues.apache.org/jira/browse/SPARK-13814 Project: Spark Issue Type: Improvement Components: PySpark Reporter: zhengruifeng Priority: Trivial Delete unnecessary imports in python examples files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13816) Add parameter checks for algorithms in Graphx
zhengruifeng created SPARK-13816: Summary: Add parameter checks for algorithms in Graphx Key: SPARK-13816 URL: https://issues.apache.org/jira/browse/SPARK-13816 Project: Spark Issue Type: Improvement Components: GraphX Reporter: zhengruifeng Priority: Trivial Add parameter checks in Graphx-Algorithms: maxIterations in Pregel maxSteps in LabelPropagation numIter,resetProb,tol in PageRank maxIters,maxVal,minVal in SVDPlusPlus -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14005) Make RDD more compatible with Scala's collection
[ https://issues.apache.org/jira/browse/SPARK-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202658#comment-15202658 ] zhengruifeng commented on SPARK-14005: -- I think easiness to implement should not be the reason to ignore convenience. > Make RDD more compatible with Scala's collection > - > > Key: SPARK-14005 > URL: https://issues.apache.org/jira/browse/SPARK-14005 > Project: Spark > Issue Type: Question > Components: Spark Core >Reporter: zhengruifeng >Priority: Trivial > > How about implementing some more methods for RDD to make it more compatible > with Scala's collection? > Such as: > nonEmpty, slice, takeRight, contains, last, reverse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13970) Add Non-Negative Matrix Factorization to MLlib
zhengruifeng created SPARK-13970: Summary: Add Non-Negative Matrix Factorization to MLlib Key: SPARK-13970 URL: https://issues.apache.org/jira/browse/SPARK-13970 Project: Spark Issue Type: New Feature Components: MLlib Reporter: zhengruifeng Priority: Minor NMF is to find two non-negative matrices (W, H) whose product W * H.T approximates the non-negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction. NMF was implemented in several packages: Scikit-Learn (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF) R-NMF (https://cran.r-project.org/web/packages/NMF/index.html) LibNMF (http://www.univie.ac.at/rlcta/software/) I have implemented in MLlib according to the following papers: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce (http://research.microsoft.com/pubs/119077/DNMF.pdf) Algorithms for Non-negative Matrix Factorization (http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf) It can be used like this: val m = 4 val n = 3 val data = Seq( (0L, Vectors.dense(0.0, 1.0, 2.0)), (1L, Vectors.dense(3.0, 4.0, 5.0)), (3L, Vectors.dense(9.0, 0.0, 1.0)) ).map(x => IndexedRow(x._1, x._2)) val A = new IndexedRowMatrix(indexedRows).toCoordinateMatrix() val k = 2 // run the nmf algo val r = NMF.solve(A, k, 10) val rW = r.W.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix] >>> org.apache.spark.mllib.linalg.DenseMatrix = 1.1349295096806706 1.4423101890626953E-5 3.453054133110303 0.46312492493865615 0.0 0.0 0.3133764134585149 2.70684017255672 val rH = r.H.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix] >>> org.apache.spark.mllib.linalg.DenseMatrix = 0.4184163313845057 3.2719352525149286 1.121880126136450.002939823716977737 1.456499371939653 0.18992996116069297 val R = rW.multiply(rH.transpose) >>> org.apache.spark.mllib.linalg.DenseMatrix = 0.4749202332761286 1.2732549038779071.6530268574248572 2.9601290106732367 3.8752743120480346 5.117332475154927 0.0 0.0 0.0 8.987727592773672 0.35952840319637736 0.9705425982249293 val AD = A.toBlockMatrix().toLocalMatrix() >>> org.apache.spark.mllib.linalg.Matrix = 0.0 1.0 2.0 3.0 4.0 5.0 0.0 0.0 0.0 9.0 0.0 1.0 var loss = 0.0 for(i <- 0 until AD.numRows; j <- 0 until AD.numCols) { val diff = AD(i, j) - R(i, j) loss += diff * diff } loss >>> Double = 0.5817999580912183 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?
zhengruifeng created SPARK-14022: Summary: What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm? Key: SPARK-14022 URL: https://issues.apache.org/jira/browse/SPARK-14022 Project: Spark Issue Type: Question Reporter: zhengruifeng Priority: Minor What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm? RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces the dimensionality by projecting the original input space on a randomly generated matrix. It is fully scalable, and runs fast (maybe fastest). It was implemented in sklearn (http://scikit-learn.org/stable/modules/random_projection.html) I am be willing to do this, if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13712) Add OneVsOne to ML
[ https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-13712: - Comment: was deleted (was: OK, I have closed the PR. I had also planned to implement ECC after this PR. In general, OneVsOne is slowest among the three methods, but it generate the highest accuracy. ECC is the fastest one (about log(num_class) submodels) with lowest accuracy. OneVsRest is in middle of them, both speed and accuracy. In most case, num_class is a small number, and so OneVsOne is useful. Suppose there are 3 classes, OneVsOne is even faster than OneVsRest. So I think it may be a useful choice for user.) > Add OneVsOne to ML > -- > > Key: SPARK-13712 > URL: https://issues.apache.org/jira/browse/SPARK-13712 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Minor > > Another Meta method for multi-class classification. > Most classification algorithms were designed for balanced data. > The OneVsRest method will generate K models on imbalanced data. > The OneVsOne will train K*(K-1)/2 models on balanced data. > OneVsOne is less sensitive to the problems of imbalanced datasets, and can > usually result in higher precision. > But it is much more computationally expensive, although each model are > trained on a much smaller dataset. (2/K of total) > The OneVsOne is implemented in the way OneVsRest did: > val classifier = new LogisticRegression() > val ovo = new OneVsOne() > ovo.setClassifier(classifier) > val ovoModel = ovo.fit(data) > val predictions = ovoModel.transform(data) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13712) Add OneVsOne to ML
[ https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194554#comment-15194554 ] zhengruifeng commented on SPARK-13712: -- OK, I have closed the PR. I had also planned to implement ECC after this PR. In general, OneVsOne is slowest among the three methods, but it generate the highest accuracy. ECC is the fastest one (about log(num_class) submodels) with lowest accuracy. OneVsRest is in middle of them, both speed and accuracy. In most case, num_class is a small number, and so OneVsOne is useful. Suppose there are 3 classes, OneVsOne is even faster than OneVsRest. So I think it may be a useful choice for user. > Add OneVsOne to ML > -- > > Key: SPARK-13712 > URL: https://issues.apache.org/jira/browse/SPARK-13712 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Minor > > Another Meta method for multi-class classification. > Most classification algorithms were designed for balanced data. > The OneVsRest method will generate K models on imbalanced data. > The OneVsOne will train K*(K-1)/2 models on balanced data. > OneVsOne is less sensitive to the problems of imbalanced datasets, and can > usually result in higher precision. > But it is much more computationally expensive, although each model are > trained on a much smaller dataset. (2/K of total) > The OneVsOne is implemented in the way OneVsRest did: > val classifier = new LogisticRegression() > val ovo = new OneVsOne() > ovo.setClassifier(classifier) > val ovoModel = ovo.fit(data) > val predictions = ovoModel.transform(data) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13712) Add OneVsOne to ML
[ https://issues.apache.org/jira/browse/SPARK-13712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194555#comment-15194555 ] zhengruifeng commented on SPARK-13712: -- OK, I have closed the PR. I had also planned to implement ECC after this PR. In general, OneVsOne is slowest among the three methods, but it generate the highest accuracy. ECC is the fastest one (about log(num_class) submodels) with lowest accuracy. OneVsRest is in middle of them, both speed and accuracy. In most case, num_class is a small number, and so OneVsOne is useful. Suppose there are 3 classes, OneVsOne is even faster than OneVsRest. So I think it may be a useful choice for user. > Add OneVsOne to ML > -- > > Key: SPARK-13712 > URL: https://issues.apache.org/jira/browse/SPARK-13712 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Priority: Minor > > Another Meta method for multi-class classification. > Most classification algorithms were designed for balanced data. > The OneVsRest method will generate K models on imbalanced data. > The OneVsOne will train K*(K-1)/2 models on balanced data. > OneVsOne is less sensitive to the problems of imbalanced datasets, and can > usually result in higher precision. > But it is much more computationally expensive, although each model are > trained on a much smaller dataset. (2/K of total) > The OneVsOne is implemented in the way OneVsRest did: > val classifier = new LogisticRegression() > val ovo = new OneVsOne() > ovo.setClassifier(classifier) > val ovoModel = ovo.fit(data) > val predictions = ovoModel.transform(data) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239140#comment-15239140 ] zhengruifeng commented on SPARK-14516: -- ok, I will work on clarify this API. > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: zhengruifeng >Priority: Minor > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239140#comment-15239140 ] zhengruifeng edited comment on SPARK-14516 at 4/13/16 12:22 PM: ok, I will clarify this API. was (Author: podongfeng): ok, I will work on clarify this API. > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: zhengruifeng >Priority: Minor > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14510) Add args-checking for LDA and StreamingKMeans
zhengruifeng created SPARK-14510: Summary: Add args-checking for LDA and StreamingKMeans Key: SPARK-14510 URL: https://issues.apache.org/jira/browse/SPARK-14510 Project: Spark Issue Type: Improvement Components: MLlib Reporter: zhengruifeng Add args-checking for LDA and StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14509) Add python CountVectorizerExample
zhengruifeng created SPARK-14509: Summary: Add python CountVectorizerExample Key: SPARK-14509 URL: https://issues.apache.org/jira/browse/SPARK-14509 Project: Spark Issue Type: Improvement Components: Documentation Reporter: zhengruifeng Priority: Minor Add the missing python example for CountVectorizer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?
[ https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-14022: - Issue Type: Brainstorming (was: Question) > What about adding RandomProjection to ML/MLLIB as a new dimensionality > reduction algorithm? > --- > > Key: SPARK-14022 > URL: https://issues.apache.org/jira/browse/SPARK-14022 > Project: Spark > Issue Type: Brainstorming >Reporter: zhengruifeng >Priority: Minor > > What about adding RandomProjection to ML/MLLIB as a new dimensionality > reduction algorithm? > RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces > the dimensionality by projecting the original input space on a randomly > generated matrix. > It is fully scalable, and runs fast (maybe fastest). > It was implemented in sklearn > (http://scikit-learn.org/stable/modules/random_projection.html) > I am be willing to do this, if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?
[ https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reopened SPARK-14022: -- There may need some discuss on whether to add RandomProjection or Not. > What about adding RandomProjection to ML/MLLIB as a new dimensionality > reduction algorithm? > --- > > Key: SPARK-14022 > URL: https://issues.apache.org/jira/browse/SPARK-14022 > Project: Spark > Issue Type: Brainstorming >Reporter: zhengruifeng >Priority: Minor > > What about adding RandomProjection to ML/MLLIB as a new dimensionality > reduction algorithm? > RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces > the dimensionality by projecting the original input space on a randomly > generated matrix. > It is fully scalable, and runs fast (maybe fastest). > It was implemented in sklearn > (http://scikit-learn.org/stable/modules/random_projection.html) > I am be willing to do this, if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14514) Add python example for VectorSlicer
zhengruifeng created SPARK-14514: Summary: Add python example for VectorSlicer Key: SPARK-14514 URL: https://issues.apache.org/jira/browse/SPARK-14514 Project: Spark Issue Type: Improvement Reporter: zhengruifeng Add the missing python example for VectorSlicer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14515) Add python example for ChiSqSelector
zhengruifeng created SPARK-14515: Summary: Add python example for ChiSqSelector Key: SPARK-14515 URL: https://issues.apache.org/jira/browse/SPARK-14515 Project: Spark Issue Type: Improvement Components: Documentation Reporter: zhengruifeng Add the missing python example for ChiSqSelector -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14514) Add python example for VectorSlicer
[ https://issues.apache.org/jira/browse/SPARK-14514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-14514: - Component/s: Documentation > Add python example for VectorSlicer > --- > > Key: SPARK-14514 > URL: https://issues.apache.org/jira/browse/SPARK-14514 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: zhengruifeng > > Add the missing python example for VectorSlicer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?
[ https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233937#comment-15233937 ] zhengruifeng commented on SPARK-14022: -- Ok, I change the Type from Question to Brainstroming. I reopen this JIRA because I think it maybe nice to add RandomProjection algorithm. > What about adding RandomProjection to ML/MLLIB as a new dimensionality > reduction algorithm? > --- > > Key: SPARK-14022 > URL: https://issues.apache.org/jira/browse/SPARK-14022 > Project: Spark > Issue Type: Brainstorming >Reporter: zhengruifeng >Priority: Minor > > What about adding RandomProjection to ML/MLLIB as a new dimensionality > reduction algorithm? > RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces > the dimensionality by projecting the original input space on a randomly > generated matrix. > It is fully scalable, and runs fast (maybe fastest). > It was implemented in sklearn > (http://scikit-learn.org/stable/modules/random_projection.html) > I am be willing to do this, if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13385) Enable AssociationRules to generate consequents with user-defined lengths
[ https://issues.apache.org/jira/browse/SPARK-13385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-13385: - Priority: Major (was: Minor) > Enable AssociationRules to generate consequents with user-defined lengths > - > > Key: SPARK-13385 > URL: https://issues.apache.org/jira/browse/SPARK-13385 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 >Reporter: zhengruifeng >Assignee: zhengruifeng > Attachments: rule-generation.pdf > > > AssociationRules should generates all association rules with user-defined > iterations, no just rules which have a single item as the consequent. > Such as: > 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117 > 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939 > 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007 > 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418 > 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997 > 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276 > 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874 > ... > I have implemented it based on Apriori's Rule-Generation Algorithm: > https://github.com/zhengruifeng/spark-rules > It's compatible with fpm's APIs. > import org.apache.spark.mllib.fpm._ > val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat") > val transactions = data.map(s => s.trim.split(' ')).persist() > val fpg = new FPGrowth().setMinSupport(0.01) > val model = fpg.run(transactions) > val ar = new AprioriRules().setMinConfidence(0.1).setMaxConsequent(15) > val results = ar.run(model.freqItemsets) > and it output rule-generation infomation like this: > 15/11/04 11:28:46 INFO AprioriRules: Candidates for 1-consequent rules : > 312917 > 15/11/04 11:28:58 INFO AprioriRules: Generated 1-consequent rules : 306703 > 15/11/04 11:29:10 INFO AprioriRules: Candidates for 2-consequent rules : > 707747 > 15/11/04 11:29:35 INFO AprioriRules: Generated 2-consequent rules : 704000 > 15/11/04 11:29:55 INFO AprioriRules: Candidates for 3-consequent rules : > 1020253 > 15/11/04 11:30:38 INFO AprioriRules: Generated 3-consequent rules : 1014002 > 15/11/04 11:31:14 INFO AprioriRules: Candidates for 4-consequent rules : > 972225 > 15/11/04 11:32:00 INFO AprioriRules: Generated 4-consequent rules : 956483 > 15/11/04 11:32:44 INFO AprioriRules: Candidates for 5-consequent rules : > 653749 > 15/11/04 11:33:32 INFO AprioriRules: Generated 5-consequent rules : 626993 > 15/11/04 11:34:07 INFO AprioriRules: Candidates for 6-consequent rules : > 331038 > 15/11/04 11:34:50 INFO AprioriRules: Generated 6-consequent rules : 314455 > 15/11/04 11:35:10 INFO AprioriRules: Candidates for 7-consequent rules : > 138490 > 15/11/04 11:35:43 INFO AprioriRules: Generated 7-consequent rules : 136260 > 15/11/04 11:35:57 INFO AprioriRules: Candidates for 8-consequent rules : 48567 > 15/11/04 11:36:14 INFO AprioriRules: Generated 8-consequent rules : 47331 > 15/11/04 11:36:24 INFO AprioriRules: Candidates for 9-consequent rules : 12430 > 15/11/04 11:36:33 INFO AprioriRules: Generated 9-consequent rules : 11925 > 15/11/04 11:36:37 INFO AprioriRules: Candidates for 10-consequent rules : 2211 > 15/11/04 11:36:47 INFO AprioriRules: Generated 10-consequent rules : 2064 > 15/11/04 11:36:55 INFO AprioriRules: Candidates for 11-consequent rules : 246 > 15/11/04 11:36:58 INFO AprioriRules: Generated 11-consequent rules : 219 > 15/11/04 11:37:00 INFO AprioriRules: Candidates for 12-consequent rules : 13 > 15/11/04 11:37:03 INFO AprioriRules: Generated 12-consequent rules : 11 > 15/11/04 11:37:03 INFO AprioriRules: Candidates for 13-consequent rules : 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14512) Add python example for QuantileDiscretizer
zhengruifeng created SPARK-14512: Summary: Add python example for QuantileDiscretizer Key: SPARK-14512 URL: https://issues.apache.org/jira/browse/SPARK-14512 Project: Spark Issue Type: Improvement Reporter: zhengruifeng Add the missing python example for QuantileDiscretizer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14516) What about adding general clustering metrics?
zhengruifeng created SPARK-14516: Summary: What about adding general clustering metrics? Key: SPARK-14516 URL: https://issues.apache.org/jira/browse/SPARK-14516 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: zhengruifeng ML/MLLIB dont have any general purposed clustering metrics with a ground truth. In [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), there are several kinds of metrics for this. It may be meaningful to add some clustering metrics into ML/MLLIB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14022) What about adding RandomProjection to ML/MLLIB as a new dimensionality reduction algorithm?
[ https://issues.apache.org/jira/browse/SPARK-14022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233941#comment-15233941 ] zhengruifeng commented on SPARK-14022: -- cc [~yanboliang] [~mengxr] [~josephkb] > What about adding RandomProjection to ML/MLLIB as a new dimensionality > reduction algorithm? > --- > > Key: SPARK-14022 > URL: https://issues.apache.org/jira/browse/SPARK-14022 > Project: Spark > Issue Type: Brainstorming >Reporter: zhengruifeng >Priority: Minor > > What about adding RandomProjection to ML/MLLIB as a new dimensionality > reduction algorithm? > RandomProjection (https://en.wikipedia.org/wiki/Random_projection) reduces > the dimensionality by projecting the original input space on a randomly > generated matrix. > It is fully scalable, and runs fast (maybe fastest). > It was implemented in sklearn > (http://scikit-learn.org/stable/modules/random_projection.html) > I am be willing to do this, if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14516) What about adding general clustering metrics?
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233946#comment-15233946 ] zhengruifeng commented on SPARK-14516: -- cc [~mengxr] [~josephkb] [~yanboliang] > What about adding general clustering metrics? > - > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: Brainstorming > Components: ML, MLlib >Reporter: zhengruifeng > > ML/MLLIB dont have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into ML/MLLIB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14027) Add parameter check to GradientDescent
zhengruifeng created SPARK-14027: Summary: Add parameter check to GradientDescent Key: SPARK-14027 URL: https://issues.apache.org/jira/browse/SPARK-14027 Project: Spark Issue Type: Improvement Components: MLlib Reporter: zhengruifeng Priority: Minor The following code should throw some exception, not just run successfully and return a model: val data = MLUtils.loadLibSVMFile(sc, "/tmp/sample_libsvm_data.txt") val model = LogisticRegressionWithSGD.train(data, -2, -0.01, 0.5) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14030) Add parameter check to LBFGS
zhengruifeng created SPARK-14030: Summary: Add parameter check to LBFGS Key: SPARK-14030 URL: https://issues.apache.org/jira/browse/SPARK-14030 Project: Spark Issue Type: Improvement Components: MLlib Reporter: zhengruifeng Priority: Trivial Add the missing parameter verification in LBFGS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14005) Make RDD more compatible with Scala's collection
[ https://issues.apache.org/jira/browse/SPARK-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203056#comment-15203056 ] zhengruifeng commented on SPARK-14005: -- ok, plz close this jira. > Make RDD more compatible with Scala's collection > - > > Key: SPARK-14005 > URL: https://issues.apache.org/jira/browse/SPARK-14005 > Project: Spark > Issue Type: Question > Components: Spark Core >Reporter: zhengruifeng >Priority: Trivial > > How about implementing some more methods for RDD to make it more compatible > with Scala's collection? > Such as: > nonEmpty, slice, takeRight, contains, last, reverse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212776#comment-15212776 ] zhengruifeng commented on SPARK-14174: -- There is another sklean example for MiniBatch KMeans: http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
zhengruifeng created SPARK-14174: Summary: Accelerate KMeans via Mini-Batch EM Key: SPARK-14174 URL: https://issues.apache.org/jira/browse/SPARK-14174 Project: Spark Issue Type: Improvement Components: MLlib Reporter: zhengruifeng Priority: Minor The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm. I have implemented mini-batch kmeans in Mllib, and the acceleration is realy significant. The MiniBatch KMeans is named XMeans in following lines. val path = "/tmp/mnist8m.scale" val data = MLUtils.loadLibSVMFile(sc, path) val vecs = data.map(_.features).persist() val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", seed=123l) km.computeCost(vecs) res0: Double = 3.317029898599564E8 val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) xm.computeCost(vecs) res1: Double = 3.3169865959604424E8 val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) xm2.computeCost(vecs) res2: Double = 3.317195831216454E8 The above three training all reached the max number of iterations 10. We can see that the WSSSEs are almost the same. While their speed perfermence have significant difference: KMeans2876sec MiniBatch KMeans (fraction=0.1) 263sec MiniBatch KMeans (fraction=0.01) 90sec With appropriate fraction, the bigger the dataset is, the higher speedup is. The data used above have 8,100,000 samples, 784 features. It can be downloaded here (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14005) Make RDD more compatible with Scala's collection
zhengruifeng created SPARK-14005: Summary: Make RDD more compatible with Scala's collection Key: SPARK-14005 URL: https://issues.apache.org/jira/browse/SPARK-14005 Project: Spark Issue Type: Question Components: Spark Core Reporter: zhengruifeng Priority: Trivial How about implementing some more methods for RDD to make it more compatible with Scala's collection? Such as: nonEmpty, slice, takeRight, contains, last, reverse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13677) Support Tree-Based Feature Transformation for mllib
zhengruifeng created SPARK-13677: Summary: Support Tree-Based Feature Transformation for mllib Key: SPARK-13677 URL: https://issues.apache.org/jira/browse/SPARK-13677 Project: Spark Issue Type: New Feature Reporter: zhengruifeng Priority: Minor It would be nice to be able to use RF and GBT for feature transformation: First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. This method was first introduced by facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is implemented in two famous library: sklearn (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) xgboost (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) I have implement it in mllib: val features : RDD[Vector] = ... val model1 : RandomForestModel = ... val transformed1 : RDD[Vector] = model1.leaf(features) val model2 : GradientBoostedTreesModel = ... val transformed2 : RDD[Vector] = model2.leaf(features) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB
zhengruifeng created SPARK-13672: Summary: Add python examples of BisectingKMeans in ML and MLLIB Key: SPARK-13672 URL: https://issues.apache.org/jira/browse/SPARK-13672 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: zhengruifeng Priority: Trivial add the missing python examples of BisectingKMeans for ml and mllib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13714) Another ConnectedComponents based on Max-Degree Propagation
zhengruifeng created SPARK-13714: Summary: Another ConnectedComponents based on Max-Degree Propagation Key: SPARK-13714 URL: https://issues.apache.org/jira/browse/SPARK-13714 Project: Spark Issue Type: New Feature Components: GraphX Reporter: zhengruifeng Priority: Minor Current ConnectedComponents algorithm was based on Min-VertexId Propagation, which is sensitive to the place of Min-VertexId. While this implementation is based on Max-Degree Propagation. First, the degree graph is computed. And in the pregel progress, the vertex with the max degree in a CC is the start point of propagation. This new method has advantages over the old one: 1, The convergence is only determined by the structs of CC, and is robust to the place of vertex with Min-ID. 2, For spherical CCs in which there may be a concept like 'center', it can accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), the old CC need 4 supersteps, while the new one only need 2 supersteps. 3, If we limit the number of iteration, the new method tend to generate more acceptable results. 4, The output for each CC is the vertex with max degree in it, which may be more meaningful. And because the vertex-ID is nominal in most cases, the vertex with min-ID in a CC is somewhat meanless. But there are still two disadvantages: 1,The message boy grows, from (VID) to (VID, Degree). that is (Long) -> (Long, Int) 2,For graph with simple CCs, it may be slower than old one. Because it need a extra degree computation. The api is the same like ConnectedComponents: val graph = ... val cc = graph.ConnectedComponentsWithDegree(100) or val cc = ConnectedComponentsWithDegree.run(graph, 100) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13714) Another ConnectedComponents based on Max-Degree Propagation
[ https://issues.apache.org/jira/browse/SPARK-13714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-13714: - Description: Current ConnectedComponents algorithm was based on Min-VertexId Propagation, which is sensitive to the place of Min-VertexId. While this implementation is based on Max-Degree Propagation. First, the degree graph is computed. And in the pregel progress, the vertex with the max degree in a CC is the start point of propagation. This new method has advantages over the old one: 1, The convergence is only determined by the structs of CC, and is robust to the place of vertex with Min-ID. 2, For spherical CCs in which there may be a concept like 'center', it can accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), the old CC need 4 supersteps, while the new one only need 2 supersteps. 3, If we limit the number of iteration, the new method tend to generate more acceptable results. 4, The output for each CC is the vertex with max degree in it, which may be more meaningful. And because the vertex-ID is nominal in most cases, the vertex with min-ID in a CC is somewhat meanless. But there are still two disadvantages: 1,The message body grows, from (VID) to (VID, Degree). that is (Long) -> (Long, Int) 2,For graph with simple CCs, it may be slower than old one. Because it need a extra degree computation. The api is the same like ConnectedComponents: val graph = ... val cc = graph.ConnectedComponentsWithDegree(100) or val cc = ConnectedComponentsWithDegree.run(graph, 100) was: Current ConnectedComponents algorithm was based on Min-VertexId Propagation, which is sensitive to the place of Min-VertexId. While this implementation is based on Max-Degree Propagation. First, the degree graph is computed. And in the pregel progress, the vertex with the max degree in a CC is the start point of propagation. This new method has advantages over the old one: 1, The convergence is only determined by the structs of CC, and is robust to the place of vertex with Min-ID. 2, For spherical CCs in which there may be a concept like 'center', it can accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), the old CC need 4 supersteps, while the new one only need 2 supersteps. 3, If we limit the number of iteration, the new method tend to generate more acceptable results. 4, The output for each CC is the vertex with max degree in it, which may be more meaningful. And because the vertex-ID is nominal in most cases, the vertex with min-ID in a CC is somewhat meanless. But there are still two disadvantages: 1,The message boy grows, from (VID) to (VID, Degree). that is (Long) -> (Long, Int) 2,For graph with simple CCs, it may be slower than old one. Because it need a extra degree computation. The api is the same like ConnectedComponents: val graph = ... val cc = graph.ConnectedComponentsWithDegree(100) or val cc = ConnectedComponentsWithDegree.run(graph, 100) > Another ConnectedComponents based on Max-Degree Propagation > --- > > Key: SPARK-13714 > URL: https://issues.apache.org/jira/browse/SPARK-13714 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: zhengruifeng >Priority: Minor > > Current ConnectedComponents algorithm was based on Min-VertexId Propagation, > which is sensitive to the place of Min-VertexId. > While this implementation is based on Max-Degree Propagation. > First, the degree graph is computed. And in the pregel progress, the vertex > with the max degree in a CC is the start point of propagation. > This new method has advantages over the old one: > 1, The convergence is only determined by the structs of CC, and is robust to > the place of vertex with Min-ID. > 2, For spherical CCs in which there may be a concept like 'center', it can > accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), > the old CC need 4 supersteps, while the new one only need 2 supersteps. > 3, If we limit the number of iteration, the new method tend to generate more > acceptable results. > 4, The output for each CC is the vertex with max degree in it, which may be > more meaningful. And because the vertex-ID is nominal in most cases, the > vertex with min-ID in a CC is somewhat meanless. > But there are still two disadvantages: > 1,The message body grows, from (VID) to (VID, Degree). that is (Long) -> > (Long, Int) > 2,For graph with simple CCs, it may be slower than old one. Because it need a > extra degree computation. > The api is the same like ConnectedComponents: > val graph = ... > val cc = graph.ConnectedComponentsWithDegree(100) > or > val cc = ConnectedComponentsWithDegree.run(graph, 100) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SPARK-13712) Add OneVsOne to ML
zhengruifeng created SPARK-13712: Summary: Add OneVsOne to ML Key: SPARK-13712 URL: https://issues.apache.org/jira/browse/SPARK-13712 Project: Spark Issue Type: New Feature Components: ML Reporter: zhengruifeng Priority: Minor Another Meta method for multi-class classification. Most classification algorithms were designed for balanced data. The OneVsRest method will generate K models on imbalanced data. The OneVsOne will train K*(K-1)/2 models on balanced data. OneVsOne is less sensitive to the problems of imbalanced datasets, and can usually result in higher precision. But it is much more computationally expensive, although each model are trained on a much smaller dataset. (2/K of total) The OneVsOne is implemented in the way OneVsRest did: val classifier = new LogisticRegression() val ovo = new OneVsOne() ovo.setClassifier(classifier) val ovoModel = ovo.fit(data) val predictions = ovoModel.transform(data) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14352) approxQuantile should support multi columns
zhengruifeng created SPARK-14352: Summary: approxQuantile should support multi columns Key: SPARK-14352 URL: https://issues.apache.org/jira/browse/SPARK-14352 Project: Spark Issue Type: Improvement Components: SQL Reporter: zhengruifeng It will be convenient and efficient to calculate quantiles of multi-columns with approxQuantile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14272) Evaluate GaussianMixtureModel with LogLooklihood
zhengruifeng created SPARK-14272: Summary: Evaluate GaussianMixtureModel with LogLooklihood Key: SPARK-14272 URL: https://issues.apache.org/jira/browse/SPARK-14272 Project: Spark Issue Type: Improvement Components: MLlib Reporter: zhengruifeng Priority: Minor GMM use EM to maximum the likelihood of data. So likelihood can be a useful metric to evaluate GaussianMixtureModel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14339) Add python examples for DCT,MinMaxScaler,MaxAbsScaler
zhengruifeng created SPARK-14339: Summary: Add python examples for DCT,MinMaxScaler,MaxAbsScaler Key: SPARK-14339 URL: https://issues.apache.org/jira/browse/SPARK-14339 Project: Spark Issue Type: Improvement Reporter: zhengruifeng Priority: Minor add three python examples -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org