[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302952#comment-14302952 ] yuhao yang commented on SPARK-1405: --- Hi everyone, I'm sharing an implementation of [Online LDA|https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf] at https://github.com/hhbyyh/OnlineLDA_Spark, and hope it can be helpful for anyone interested. The work is based on the research from [Matt Hoffman|http://www.cs.princeton.edu/~mdhoffma/] and [David M. Blei|http://www.cs.princeton.edu/~blei/topicmodeling.html]. Based on its online nature, the algorithm 1. scans the corpus (doc sets) only once. Thus it {quote}needs not locally store or collect the documents and can be handily applied to streaming document collections. {quote} 2. breaks the massive corps into mini batches and takes one batch at a time, which downgrades memory and time consumption. 3. approximates the posterior as well as traditional approaches. (generate comparable or better results). In demo runs, current implementation (with many details to be improved) 1. processed 8 millions short articles (Stackoverflow posts titles, avg length 9, K=10) in 15 minutes. 2. processed entire English wiki dump set (5876K documents , avg length ~900 words/per doc, 30G on disk, K=10) in 2 hours and 17 minutes using a 4-node cluster(20G memory, can be much less) Trial and suggestions are most welcome! parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Joseph K. Bradley Priority: Critical Labels: features Fix For: 1.3.0 Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302952#comment-14302952 ] yuhao yang edited comment on SPARK-1405 at 2/3/15 8:35 AM: --- Hi everyone, I'm sharing an implementation of [Online LDA|https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf] at https://github.com/hhbyyh/OnlineLDA_Spark, and hope it can be helpful for anyone interested. The work is based on the research from [Matt Hoffman|http://www.cs.princeton.edu/~mdhoffma/] and [David M. Blei|http://www.cs.princeton.edu/~blei/topicmodeling.html]. Based on its online nature, the algorithm 1. scans the corpus (doc sets) only once. Thus it needs not locally store or collect the documents and can be handily applied to streaming document collections. 2. breaks the massive corps into mini batches and takes one batch at a time, which downgrades memory and time consumption. 3. approximates the posterior as well as traditional approaches. (generate comparable or better results). In demo runs, current implementation (with many details to be improved) 1. processed 8 millions short articles (Stackoverflow posts titles, avg length 9, K=10) in 15 minutes. 2. processed entire English wiki dump set (5876K documents , avg length ~900 words/per doc, 30G on disk, K=10) in 2 hours and 17 minutes using a 4-node cluster(20G memory, can be much less) Trial and suggestions are most welcome! was (Author: yuhaoyan): Hi everyone, I'm sharing an implementation of [Online LDA|https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf] at https://github.com/hhbyyh/OnlineLDA_Spark, and hope it can be helpful for anyone interested. The work is based on the research from [Matt Hoffman|http://www.cs.princeton.edu/~mdhoffma/] and [David M. Blei|http://www.cs.princeton.edu/~blei/topicmodeling.html]. Based on its online nature, the algorithm 1. scans the corpus (doc sets) only once. Thus it {quote}needs not locally store or collect the documents and can be handily applied to streaming document collections. {quote} 2. breaks the massive corps into mini batches and takes one batch at a time, which downgrades memory and time consumption. 3. approximates the posterior as well as traditional approaches. (generate comparable or better results). In demo runs, current implementation (with many details to be improved) 1. processed 8 millions short articles (Stackoverflow posts titles, avg length 9, K=10) in 15 minutes. 2. processed entire English wiki dump set (5876K documents , avg length ~900 words/per doc, 30G on disk, K=10) in 2 hours and 17 minutes using a 4-node cluster(20G memory, can be much less) Trial and suggestions are most welcome! parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Joseph K. Bradley Priority: Critical Labels: features Fix For: 1.3.0 Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5566) Tokenizer for mllib package
[ https://issues.apache.org/jira/browse/SPARK-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308733#comment-14308733 ] yuhao yang commented on SPARK-5566: --- I mean only the underlying implementation. Tokenizer for mllib package --- Key: SPARK-5566 URL: https://issues.apache.org/jira/browse/SPARK-5566 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley There exist tokenizer classes in the spark.ml.feature package and in the LDAExample in the spark.examples.mllib package. The Tokenizer in the LDAExample is more advanced and should be made into a full-fledged public class in spark.mllib.feature. The spark.ml.feature.Tokenizer class should become a wrapper around the new Tokenizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305115#comment-14305115 ] yuhao yang edited comment on SPARK-5563 at 2/4/15 2:22 PM: --- Thanks Joseph for helping create the jira. Paste previous [comment link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952] here and share the current implementation at https://github.com/hhbyyh/OnlineLDA_Spark. I agree with the suggestion listed above and will propose a PR for more detailed discussion soon. Thanks. was (Author: yuhaoyan): Thanks Joseph for helping create the jira. Paste previous [comment link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952] here and share the current implementation at https://github.com/hhbyyh/OnlineLDA_Spark. I agree with the suggestion listed above and will propose a PR for more detailed discussion soon. Thanks LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305115#comment-14305115 ] yuhao yang commented on SPARK-5563: --- Thanks Joseph for helping create the jira. Paste previous [comment link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952] here and share the current implementation at https://github.com/hhbyyh/OnlineLDA_Spark. I agree with the suggestion listed above and will propose a PR for more detailed discussion soon. Thanks LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305115#comment-14305115 ] yuhao yang edited comment on SPARK-5563 at 2/4/15 2:23 PM: --- Thanks Joseph for helping create the jira. Paste previous [comment link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952] here and share the current implementation at https://github.com/hhbyyh/OnlineLDA_Spark. I agree with the suggestion listed above and will propose a PR for more detailed discussion soon (ETA tomorrow). Thanks. was (Author: yuhaoyan): Thanks Joseph for helping create the jira. Paste previous [comment link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952] here and share the current implementation at https://github.com/hhbyyh/OnlineLDA_Spark. I agree with the suggestion listed above and will propose a PR for more detailed discussion soon. Thanks. LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305199#comment-14305199 ] yuhao yang commented on SPARK-5563: --- BTW, batch versions of online variational inference is useful when processing small data set (especially toy data in ut). LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5566) Tokenizer for mllib package
[ https://issues.apache.org/jira/browse/SPARK-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305172#comment-14305172 ] yuhao yang commented on SPARK-5566: --- Actually I believe many current code like Word2Vec and HashingTF share the similar data flow and it's best if we can take the common requirement into consideration. Tokenizer for mllib package --- Key: SPARK-5566 URL: https://issues.apache.org/jira/browse/SPARK-5566 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley There exist tokenizer classes in the spark.ml.feature package and in the LDAExample in the spark.examples.mllib package. The Tokenizer in the LDAExample is more advanced and should be made into a full-fledged public class in spark.mllib.feature. The spark.ml.feature.Tokenizer class should become a wrapper around the new Tokenizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning
[ https://issues.apache.org/jira/browse/SPARK-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-5282. - fixed RowMatrix easily gets int overflow in the memory size warning - Key: SPARK-5282 URL: https://issues.apache.org/jira/browse/SPARK-5282 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Assignee: yuhao yang Priority: Trivial Fix For: 1.3.0, 1.2.1 Original Estimate: 1h Remaining Estimate: 1h The warning in the RowMatrix will easily get int overflow when the cols is larger than 16385. minor issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size
[ https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280025#comment-14280025 ] yuhao yang commented on SPARK-5186: --- I just updated the PR with a hashCode fix. Please help review at will. Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size - Key: SPARK-5186 URL: https://issues.apache.org/jira/browse/SPARK-5186 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Derrick Burns Original Estimate: 0.25h Remaining Estimate: 0.25h The implementation of Vector.equals and Vector.hashCode are correct but slow for SparseVectors that are truly sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5234) examples for ml don't have sparkContext.stop
[ https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-5234. - fixed examples for ml don't have sparkContext.stop Key: SPARK-5234 URL: https://issues.apache.org/jira/browse/SPARK-5234 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Environment: all Reporter: yuhao yang Assignee: yuhao yang Priority: Trivial Fix For: 1.3.0, 1.2.1 Original Estimate: 1h Remaining Estimate: 1h Not sure why sc.stop() is not in the org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, SimpleTextClassificationPipeline}. I can prepare a PR if it's not intentional to omit the call to stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning
yuhao yang created SPARK-5282: - Summary: RowMatrix easily gets int overflow in the memory size warning Key: SPARK-5282 URL: https://issues.apache.org/jira/browse/SPARK-5282 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Trivial The warning in the RowMatrix will easily get int overflow when the cols is larger than 16385. minor issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning
[ https://issues.apache.org/jira/browse/SPARK-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280159#comment-14280159 ] yuhao yang commented on SPARK-5282: --- typical wrong message: Row matrix: 17000 cloumns will require at least -1982967296 bytes of memory! PR on the way. RowMatrix easily gets int overflow in the memory size warning - Key: SPARK-5282 URL: https://issues.apache.org/jira/browse/SPARK-5282 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Trivial Original Estimate: 1h Remaining Estimate: 1h The warning in the RowMatrix will easily get int overflow when the cols is larger than 16385. minor issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5717) add sc.stop to LDA examples
[ https://issues.apache.org/jira/browse/SPARK-5717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-5717. - merged. Thanks add sc.stop to LDA examples --- Key: SPARK-5717 URL: https://issues.apache.org/jira/browse/SPARK-5717 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: yuhao yang Assignee: yuhao yang Priority: Trivial Fix For: 1.3.0 Original Estimate: 1h Remaining Estimate: 1h Trivial. add sc stop and reorganize import in LDAExample and JavaLDAExample -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths
[ https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-5384. - fixed Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths -- Key: SPARK-5384 URL: https://issues.apache.org/jira/browse/SPARK-5384 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.1 Environment: centos, others should be similar Reporter: yuhao yang Assignee: yuhao yang Priority: Critical Fix For: 1.3.0 Original Estimate: 24h Remaining Estimate: 24h For two vectors of different lengths, Vectors.sqdist would return different result when the vectors are represented as sparse and dense respectively. Sample: val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0)) val s2 = new SparseVector(1, Array(0), Array(9.0)) val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0)) val d2 = new DenseVector(Array(9.0)) println(s1 == d1 s2 == d2) println(Vectors.sqdist(s1, s2)) println(Vectors.sqdist(d1, d2)) result: true 93.0 64.0 More precisely, for the extra part, Vectors.sqdist would include it for sparse vectors and exclude it for dense vectors. I'll send a PR and we can have more detailed discussion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
yuhao yang created SPARK-5406: - Summary: LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound Key: SPARK-5406 URL: https://issues.apache.org/jira/browse/SPARK-5406 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Minor In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation: val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n Int.MaxValue thus, n 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
[ https://issues.apache.org/jira/browse/SPARK-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-5406: -- Description: In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation ( https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala ): val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n Int.MaxValue thus, n 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. was: In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala): val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n Int.MaxValue thus, n 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound - Key: SPARK-5406 URL: https://issues.apache.org/jira/browse/SPARK-5406 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Minor Original Estimate: 2h Remaining Estimate: 2h In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation ( https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala ): val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n Int.MaxValue thus, n 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
[ https://issues.apache.org/jira/browse/SPARK-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-5406: -- Description: In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala): val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n Int.MaxValue thus, n 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. was: In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation: val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n Int.MaxValue thus, n 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound - Key: SPARK-5406 URL: https://issues.apache.org/jira/browse/SPARK-5406 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Minor Original Estimate: 2h Remaining Estimate: 2h In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala): val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n Int.MaxValue thus, n 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
[ https://issues.apache.org/jira/browse/SPARK-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-5406. - fix and merged. Thanks LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound - Key: SPARK-5406 URL: https://issues.apache.org/jira/browse/SPARK-5406 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Assignee: yuhao yang Priority: Minor Fix For: 1.3.0 Original Estimate: 2h Remaining Estimate: 2h In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's implementation ( https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala ): val workSize = ( 3 * scala.math.min(m, n) * scala.math.min(m, n) + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n) * scala.math.min(m, n) + 4 * scala.math.min(m, n)) ) val work = new Array[Double](workSize) as a result, column num must satisfy 7 * n * n + 4 * n Int.MaxValue thus, n 17515. This jira is only the first step. If possbile, I hope spark can handle matrix computation up to 80K * 80K. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5510) How can I fix the spark-submit script and then running the program on cluster ?
[ https://issues.apache.org/jira/browse/SPARK-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300939#comment-14300939 ] yuhao yang commented on SPARK-5510: --- https://spark.apache.org/community.html check the mailing list section. How can I fix the spark-submit script and then running the program on cluster ? --- Key: SPARK-5510 URL: https://issues.apache.org/jira/browse/SPARK-5510 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.2 Reporter: hash-x Labels: Help!!, spark-submit Reference: My Question is how can I fix the script and can submit the program to a Master from my laptop? Not submit the program from a cluster. Submit program from Node 2 is work for me.But the laptop is not!How can i do to fix ??? help!!! I have looked the follow Email and I accept the recommend of One - run spark-shell from a cluster node! But I want to solve the program with the recommend of 2.But I am confused.. Hi Ken, This is unfortunately a limitation of spark-shell and the way it works on the standalone mode. spark-shell sets an environment variable, SPARK_HOME, which tells Spark where to find its code installed on the cluster. This means that the path on your laptop must be the same as on the cluster, which is not the case. I recommend one of two things: 1) Either run spark-shell from a cluster node, where it will have the right path. (In general it’s also better for performance to have it close to the cluster) 2) Or, edit the spark-shell script and re-export SPARK_HOME right before it runs the Java command (ugly but will probably work). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5234) examples for ml don't have sparkContext.stop
yuhao yang created SPARK-5234: - Summary: examples for ml don't have sparkContext.stop Key: SPARK-5234 URL: https://issues.apache.org/jira/browse/SPARK-5234 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Environment: all Reporter: yuhao yang Priority: Trivial Fix For: 1.3.0 Not sure why sc.stop() is not in the org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, SimpleTextClassificationPipeline}. I can prepare a PR if it's not intentional to omit the call to stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster
yuhao yang created SPARK-5243: - Summary: Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster Key: SPARK-5243 URL: https://issues.apache.org/jira/browse/SPARK-5243 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Minor Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I am preparing PR for the case. And I would like to know your opinions about if a fix is needed and better fix options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270869#comment-14270869 ] yuhao yang commented on SPARK-1405: --- Great design doc and solid proposal. I noticed the online variational EM mentioned in the doc, for which I have developed a spark implementation. The work was based on an actual customer scenario and has exhibited remarkable speed and economized memory usage. The result is as good as the “batch” LDA, and with handy support of stream text from the online nature. Right now we are turning it into graph-based and will perform further evaluation afterwards. The algorithm looks promising to us and can be helpful in many cases. For now I don’t find online LDA will make the API design more complicated as it’s more like an incremental work. Just want to bring up the possibility in case anyone finds a conflict. Reference: [online LDA|https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf] by [Matt Hoffman|http://www.cs.princeton.edu/~mdhoffma/] and [David M.Blei|http://www.cs.princeton.edu/~blei/topicmodeling.html] parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5717) add sc.stop to LDA examples
yuhao yang created SPARK-5717: - Summary: add sc.stop to LDA examples Key: SPARK-5717 URL: https://issues.apache.org/jira/browse/SPARK-5717 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: yuhao yang Priority: Trivial Trivial. add sc stop and reorganize import in LDAExample and JavaLDAExample -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster
[ https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-5243: -- Description: Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I would like to know your opinions about if a fix is needed and better fix options. was: Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I am preparing PR for the case. And I would like to know your opinions about if a fix is needed and better fix options. Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster Key: SPARK-5243 URL: https://issues.apache.org/jira/browse/SPARK-5243 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Minor Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I would like to know your opinions about if a fix is needed and better fix options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster
[ https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-5243: -- Description: Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I would like to know your opinions about if a fix is needed (is this by design?) and better fix options. was: Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I would like to know your opinions about if a fix is needed and better fix options. Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster Key: SPARK-5243 URL: https://issues.apache.org/jira/browse/SPARK-5243 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Minor Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I would like to know your opinions about if a fix is needed (is this by design?) and better fix options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364350#comment-14364350 ] yuhao yang commented on SPARK-5563: --- Matthew Willson. Thanks for the attention and idea. Apart from Gensim, vowpal-wabbit also has a distributed implementation provided by Matthew D. Hoffman, which seems to be amazingly fast. I'll refer to those libraries as much as possible. And suggestions are always welcome. LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364350#comment-14364350 ] yuhao yang edited comment on SPARK-5563 at 3/17/15 1:13 AM: Matthew Willson. Thanks for the attention and idea. Apart from Gensim, vowpal-wabbit also has a distributed implementation (C++) provided by Matthew D. Hoffman, which seems to be amazingly fast. I'll refer to those libraries as much as possible. And suggestions are always welcome. was (Author: yuhaoyan): Matthew Willson. Thanks for the attention and idea. Apart from Gensim, vowpal-wabbit also has a distributed implementation provided by Matthew D. Hoffman, which seems to be amazingly fast. I'll refer to those libraries as much as possible. And suggestions are always welcome. LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm
yuhao yang created SPARK-6374: - Summary: Add getter for GeneralizedLinearAlgorithm Key: SPARK-6374 URL: https://issues.apache.org/jira/browse/SPARK-6374 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Priority: Minor I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6177) LDA should check partitions size of the input
[ https://issues.apache.org/jira/browse/SPARK-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-6177: -- Description: Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from sc.textFile. sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance. was:sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance. LDA should check partitions size of the input - Key: SPARK-6177 URL: https://issues.apache.org/jira/browse/SPARK-6177 Project: Spark Issue Type: Improvement Components: Examples, MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Priority: Minor Original Estimate: 1h Remaining Estimate: 1h Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from sc.textFile. sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6177) Add note for
[ https://issues.apache.org/jira/browse/SPARK-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-6177: -- Summary: Add note for (was: LDA should check partitions size of the input) Add note for - Key: SPARK-6177 URL: https://issues.apache.org/jira/browse/SPARK-6177 Project: Spark Issue Type: Improvement Components: Examples, MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Priority: Minor Original Estimate: 1h Remaining Estimate: 1h Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from sc.textFile. sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6177) Add note in LDA example to remind possible coalesce
[ https://issues.apache.org/jira/browse/SPARK-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-6177: -- Summary: Add note in LDA example to remind possible coalesce (was: Add note for ) Add note in LDA example to remind possible coalesce Key: SPARK-6177 URL: https://issues.apache.org/jira/browse/SPARK-6177 Project: Spark Issue Type: Improvement Components: Examples, MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Priority: Minor Original Estimate: 1h Remaining Estimate: 1h Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from sc.textFile. sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6268) KMeans parameter getter methods
[ https://issues.apache.org/jira/browse/SPARK-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356125#comment-14356125 ] yuhao yang commented on SPARK-6268: --- Sure, I'll propose a PR very soon. Thanks! KMeans parameter getter methods --- Key: SPARK-6268 URL: https://issues.apache.org/jira/browse/SPARK-6268 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor KMeans has many setters for parameters. It should have matching getters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6268) KMeans parameter getter methods
[ https://issues.apache.org/jira/browse/SPARK-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356106#comment-14356106 ] yuhao yang edited comment on SPARK-6268 at 3/11/15 2:14 AM: Hi Bradley, I hope this is not rude. Not sure if you want to do this yourself. If not, maybe I can help. Thanks. was (Author: yuhaoyan): Hi Bradley, I hope this is not rude. Not sure if you want to do this yourself. If not, maybe I can help. KMeans parameter getter methods --- Key: SPARK-6268 URL: https://issues.apache.org/jira/browse/SPARK-6268 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor KMeans has many setters for parameters. It should have matching getters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6177) Add note in LDA example to remind possible coalesce
[ https://issues.apache.org/jira/browse/SPARK-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-6177. - Fix and merged, thanks. Add note in LDA example to remind possible coalesce Key: SPARK-6177 URL: https://issues.apache.org/jira/browse/SPARK-6177 Project: Spark Issue Type: Improvement Components: Examples, MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Assignee: yuhao yang Priority: Trivial Fix For: 1.4.0 Original Estimate: 1h Remaining Estimate: 1h Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from sc.textFile. sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6177) LDA should check partitions size of the input
yuhao yang created SPARK-6177: - Summary: LDA should check partitions size of the input Key: SPARK-6177 URL: https://issues.apache.org/jira/browse/SPARK-6177 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: yuhao yang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6177) LDA should check partitions size of the input
[ https://issues.apache.org/jira/browse/SPARK-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-6177: -- Description: sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance. LDA should check partitions size of the input - Key: SPARK-6177 URL: https://issues.apache.org/jira/browse/SPARK-6177 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Original Estimate: 1h Remaining Estimate: 1h sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths
yuhao yang created SPARK-5384: - Summary: Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths Key: SPARK-5384 URL: https://issues.apache.org/jira/browse/SPARK-5384 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.1 Environment: centos, others should be similar Reporter: yuhao yang Priority: Critical Fix For: 1.2.1 For two vectors of different lengths, Vectors.sqdist would return different result when the vectors are represented as sparse and dense respectively. Sample: val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0)) val s2 = new SparseVector(1, Array(0), Array(9.0)) val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0)) val d2 = new DenseVector(Array(9.0)) println(s1 == d1 s2 == d2) println(Vectors.sqdist(s1, s2)) println(Vectors.sqdist(d1, d2)) result: true 93.0 64.0 More precisely, for the extra part, Vectors.sqdist would include it for sparse vectors and exclude it for dense vectors. I'll send a PR and we can have more detailed discussion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6693) add to string with max lines and width for matrix
yuhao yang created SPARK-6693: - Summary: add to string with max lines and width for matrix Key: SPARK-6693 URL: https://issues.apache.org/jira/browse/SPARK-6693 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: yuhao yang Priority: Minor It's kind of annoying when debugging and found you cannot print out the matrix as you want. original toString of Matrix only print like following, 0.178101025969091830.5616906241468385... (100 total) 0.9692861997823815 0.015558159784155756 ... 0.8513015122819192 0.031523763918528847 ... 0.5396875653953941 0.3267864552779176... The def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6693) add toString with max lines and width for matrix
[ https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-6693: -- Summary: add toString with max lines and width for matrix (was: add to string with max lines and width for matrix) add toString with max lines and width for matrix Key: SPARK-6693 URL: https://issues.apache.org/jira/browse/SPARK-6693 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: yuhao yang Priority: Minor Original Estimate: 2h Remaining Estimate: 2h It's kind of annoying when debugging and found you cannot print out the matrix as you want. original toString of Matrix only print like following, 0.178101025969091830.5616906241468385... (100 total) 0.9692861997823815 0.015558159784155756 ... 0.8513015122819192 0.031523763918528847 ... 0.5396875653953941 0.3267864552779176... The def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm
[ https://issues.apache.org/jira/browse/SPARK-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-6374. - fix merged. Thanks. Add getter for GeneralizedLinearAlgorithm - Key: SPARK-6374 URL: https://issues.apache.org/jira/browse/SPARK-6374 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Assignee: yuhao yang Priority: Minor Fix For: 1.4.0 Original Estimate: 1h Remaining Estimate: 1h I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6693) add toString with max lines and width for matrix
[ https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-6693. - Fix merged. Thanks. add toString with max lines and width for matrix Key: SPARK-6693 URL: https://issues.apache.org/jira/browse/SPARK-6693 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: yuhao yang Assignee: yuhao yang Priority: Minor Fix For: 1.4.0 Original Estimate: 2h Remaining Estimate: 2h It's kind of annoying when debugging and found you cannot print out the matrix as you want. original toString of Matrix only print like following, 0.178101025969091830.5616906241468385... (100 total) 0.9692861997823815 0.015558159784155756 ... 0.8513015122819192 0.031523763918528847 ... 0.5396875653953941 0.3267864552779176... The def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7090) Introduce LDAOptimizer to LDA to further improve extensibility
[ https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508907#comment-14508907 ] yuhao yang commented on SPARK-7090: --- Hoops, I thought there was something wrong... I'll close the other. Thanks Introduce LDAOptimizer to LDA to further improve extensibility -- Key: SPARK-7090 URL: https://issues.apache.org/jira/browse/SPARK-7090 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: yuhao yang Original Estimate: 72h Remaining Estimate: 72h LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7089) Introduce LDAOptimizer to LDA to improve extensibility
[ https://issues.apache.org/jira/browse/SPARK-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-7089. - Resolution: Duplicate Sorry for the duplication Introduce LDAOptimizer to LDA to improve extensibility -- Key: SPARK-7089 URL: https://issues.apache.org/jira/browse/SPARK-7089 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: yuhao yang Priority: Minor Original Estimate: 72h Remaining Estimate: 72h LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7089) Introduce LDAOptimizer to LDA to improve extensibility
yuhao yang created SPARK-7089: - Summary: Introduce LDAOptimizer to LDA to improve extensibility Key: SPARK-7089 URL: https://issues.apache.org/jira/browse/SPARK-7089 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: yuhao yang LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7090) Introduce LDAOptimizer to LDA to improve extensibility
yuhao yang created SPARK-7090: - Summary: Introduce LDAOptimizer to LDA to improve extensibility Key: SPARK-7090 URL: https://issues.apache.org/jira/browse/SPARK-7090 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: yuhao yang LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7090) Introduce LDAOptimizer to LDA to further improve extensibility
[ https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-7090: -- Summary: Introduce LDAOptimizer to LDA to further improve extensibility (was: Introduce LDAOptimizer to LDA to improve extensibility ) Introduce LDAOptimizer to LDA to further improve extensibility -- Key: SPARK-7090 URL: https://issues.apache.org/jira/browse/SPARK-7090 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: yuhao yang Original Estimate: 72h Remaining Estimate: 72h LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7090) Introduce LDAOptimizer to LDA to improve extensibility
[ https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-7090: -- Summary: Introduce LDAOptimizer to LDA to improve extensibility (was: Introduce LDAOptimizer to LDA to improve extensibility) Introduce LDAOptimizer to LDA to improve extensibility --- Key: SPARK-7090 URL: https://issues.apache.org/jira/browse/SPARK-7090 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: yuhao yang Original Estimate: 72h Remaining Estimate: 72h LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-7090) Introduce LDAOptimizer to LDA to further improve extensibility
[ https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang reopened SPARK-7090: --- Reopen this since 7089 was already closed. Introduce LDAOptimizer to LDA to further improve extensibility -- Key: SPARK-7090 URL: https://issues.apache.org/jira/browse/SPARK-7090 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: yuhao yang Original Estimate: 72h Remaining Estimate: 72h LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7090) Introduce LDAOptimizer to LDA to further improve extensibility
[ https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508907#comment-14508907 ] yuhao yang edited comment on SPARK-7090 at 4/23/15 12:00 PM: - oops, I thought there was something wrong... I'll close the other. Thanks was (Author: yuhaoyan): Hoops, I thought there was something wrong... I'll close the other. Thanks Introduce LDAOptimizer to LDA to further improve extensibility -- Key: SPARK-7090 URL: https://issues.apache.org/jira/browse/SPARK-7090 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: yuhao yang Original Estimate: 72h Remaining Estimate: 72h LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7368) add QR decomposition for RowMatrix
yuhao yang created SPARK-7368: - Summary: add QR decomposition for RowMatrix Key: SPARK-7368 URL: https://issues.apache.org/jira/browse/SPARK-7368 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Add QR decomposition for RowMatrix. There's a great distributed algorithm for QR decomposition, which I'm currently referring to. Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International Conference on Big Data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7368) add QR decomposition for RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529742#comment-14529742 ] yuhao yang commented on SPARK-7368: --- Oops, I was not aware of the previous effort. Thanks Joseph and Zongheng. I'll try with the AMPLab version and send update. add QR decomposition for RowMatrix -- Key: SPARK-7368 URL: https://issues.apache.org/jira/browse/SPARK-7368 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Original Estimate: 48h Remaining Estimate: 48h Add QR decomposition for RowMatrix. There's a great distributed algorithm for QR decomposition, which I'm currently referring to. Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International Conference on Big Data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7475) adjust ldaExample for online LDA
yuhao yang created SPARK-7475: - Summary: adjust ldaExample for online LDA Key: SPARK-7475 URL: https://issues.apache.org/jira/browse/SPARK-7475 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Priority: Minor Add a new argument to specify the algorithm applied to LDA, to exhibit the basic usage of LDAOptimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7514) Add MinMaxScaler to feature transformation
[ https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537602#comment-14537602 ] yuhao yang commented on SPARK-7514: --- Class name has always been MinMaxScaler in the code, yet I named jira wrongly... For the parameters, currently the model looks like: class MinMaxScalerModel ( +val min: Vector, +val max: Vector, +var newBase: Double, +var scale: Double) extends VectorTransformer I have used min, max to store the model statistics. In some articles, the range bounds are named newMin / newMax (I think it can be confusing). ran out of variable names here... setCenterScale looks good. Add MinMaxScaler to feature transformation -- Key: SPARK-7514 URL: https://issues.apache.org/jira/browse/SPARK-7514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized( x ) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7514) Add MinMaxScaler to feature transformation
[ https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537651#comment-14537651 ] yuhao yang commented on SPARK-7514: --- Thanks Joseph, just one concern for using center as it will change the core function from Normalized( x ) = (x - min) / (max - min) * scale + newBase to Normalized( x ) = ((x - min) / (max - min) - 0.5 )* scale + center which seems be to not as straightforward. Sure we can further discuss it over code. Add MinMaxScaler to feature transformation -- Key: SPARK-7514 URL: https://issues.apache.org/jira/browse/SPARK-7514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized( x ) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7514) Add MinMaxScaler to feature transformation
[ https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537651#comment-14537651 ] yuhao yang edited comment on SPARK-7514 at 5/11/15 6:41 AM: Thanks Joseph, just one concern for using center as it will change the core function from Normalized( x ) = (x - min) / (max - min) * scale + newBase to Normalized( x ) = ((x - min) / (max - min) - 0.5 )* scale + center which seems not as straightforward. Sure we can further discuss it over code. was (Author: yuhaoyan): Thanks Joseph, just one concern for using center as it will change the core function from Normalized( x ) = (x - min) / (max - min) * scale + newBase to Normalized( x ) = ((x - min) / (max - min) - 0.5 )* scale + center which seems be to not as straightforward. Sure we can further discuss it over code. Add MinMaxScaler to feature transformation -- Key: SPARK-7514 URL: https://issues.apache.org/jira/browse/SPARK-7514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized( x ) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7514) Add MinMaxScaler to feature transformation
[ https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-7514: -- Summary: Add MinMaxScaler to feature transformation (was: Add MinMaxNormalizer to feature transformation) Add MinMaxScaler to feature transformation -- Key: SPARK-7514 URL: https://issues.apache.org/jira/browse/SPARK-7514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized( x ) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7514) Add MinMaxNormalizer to feature transformation
[ https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537527#comment-14537527 ] yuhao yang commented on SPARK-7514: --- Hi Joseph, that a good idea. I did a simple google: weka: Class Normalize, takes scaling factor and translation ( same concepts as scale and newBase). http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/Normalize.html sklearn.preprocessing.MinMaxScaler, takes min and scale, yet in array format, http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html some implements basic MinMax and takes no extra parameters: http://docs.pervasive.com/products/DataRush/DF63/javadoc/com/pervasive/datarush/analytics/functions/StatsFunctions.html http://help.sap.com/saphelp_hanaplatform/helpdata/en/e3/f29fafd4ac43339a1a39407884e545/content.htm?frameset=/en/e6/5c78507a424be58e52877496e2b516/frameset.htmcurrent_toc=/en/32/731a7719f14e488b1f4ab0afae995b/plain.htmnode_id=52 Add MinMaxNormalizer to feature transformation -- Key: SPARK-7514 URL: https://issues.apache.org/jira/browse/SPARK-7514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized( x ) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7496) Update Programming guide with Online LDA
[ https://issues.apache.org/jira/browse/SPARK-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537482#comment-14537482 ] yuhao yang commented on SPARK-7496: --- Thanks Joseph. PR sent. Update Programming guide with Online LDA Key: SPARK-7496 URL: https://issues.apache.org/jira/browse/SPARK-7496 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Reporter: Joseph K. Bradley Priority: Minor Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7514) Add MinMaxNormalizer to feature transformation
yuhao yang created SPARK-7514: - Summary: Add MinMaxNormalizer to feature transformation Key: SPARK-7514 URL: https://issues.apache.org/jira/browse/SPARK-7514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Add a new scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized(x) = (x - min) / (max - min) * scale + newBase where newBase the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7514) Add MinMaxNormalizer to feature transformation
[ https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-7514: -- Description: Add a new scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized(x) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm was: Add a new scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized(x) = (x - min) / (max - min) * scale + newBase where newBase the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm Add MinMaxNormalizer to feature transformation -- Key: SPARK-7514 URL: https://issues.apache.org/jira/browse/SPARK-7514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Add a new scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized(x) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7514) Add MinMaxNormalizer to feature transformation
[ https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-7514: -- Description: Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized( x ) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm was: Add a new scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized( x ) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm Add MinMaxNormalizer to feature transformation -- Key: SPARK-7514 URL: https://issues.apache.org/jira/browse/SPARK-7514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized( x ) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7514) Add MinMaxNormalizer to feature transformation
[ https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-7514: -- Description: Add a new scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized( x ) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm was: Add a new scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized(x) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm Add MinMaxNormalizer to feature transformation -- Key: SPARK-7514 URL: https://issues.apache.org/jira/browse/SPARK-7514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Add a new scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized( x ) = (x - min) / (max - min) * scale + newBase where newBase and scale are parameters of the VectorTransformer. newBase is the new minimum number for the feature, and scale controls the range after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. for case that max == min, 0.5 is used as the raw value. reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7496) Update Programming guide with Online LDA
[ https://issues.apache.org/jira/browse/SPARK-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537114#comment-14537114 ] yuhao yang commented on SPARK-7496: --- Hi Joseph, just something I got for your reference, LDA takes in a collection of documents as vectors of word counts. It supports different inference algorithms via setOptimizer function. EMLDAOptimizer learns clustering using expectation-maximization on the likelihood function, while OnlineLDAOptimizer uses iterative mini-batch sampling for online variational inference, After fitting on the documents, LDA provides: Update Programming guide with Online LDA Key: SPARK-7496 URL: https://issues.apache.org/jira/browse/SPARK-7496 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Reporter: Joseph K. Bradley Priority: Minor Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7090) Introduce LDAOptimizer to LDA to further improve extensibility
[ https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-7090. - Close the jira as code merged. Thanks for the careful review and important fix. Introduce LDAOptimizer to LDA to further improve extensibility -- Key: SPARK-7090 URL: https://issues.apache.org/jira/browse/SPARK-7090 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: yuhao yang Assignee: yuhao yang Fix For: 1.4.0 Original Estimate: 72h Remaining Estimate: 72h LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7368) add QR decomposition for RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537205#comment-14537205 ] yuhao yang edited comment on SPARK-7368 at 5/10/15 2:54 PM: Hi Zongheng, since the Amplab version is built upon a different RowMatrix implementation. I'm not sure if it's appropriate to make a direct comparison. I haven't got the time to review the difference carefully. If possible, can you please share more information that has been collected, like some benchmark or capability. In the meantime, I'll do it for my PR also. And what's better is that there's a plan to migrate the Amplab version to Spark. For anyone with interests, your suggestion and trial will be most welcome. Thanks. was (Author: yuhaoyan): Hi Zongheng, since the Amplab version is built upon a different RowMatrix implementation. I'm not sure if it's appropriate to make a direct comparison. I haven't got the time to review the difference carefully. If possible, can you please share more information that has been collected, like some benchmark or capability. In the meantime, I'll do it for my PR also. And what's better is that there have been a plan to migrate the Amplab version to Spark. Let me know if you have any suggestion about how shall we proceed. Thanks. add QR decomposition for RowMatrix -- Key: SPARK-7368 URL: https://issues.apache.org/jira/browse/SPARK-7368 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Original Estimate: 48h Remaining Estimate: 48h Add QR decomposition for RowMatrix. There's a great distributed algorithm for QR decomposition, which I'm currently referring to. Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International Conference on Big Data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7368) add QR decomposition for RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537205#comment-14537205 ] yuhao yang commented on SPARK-7368: --- Hi Zongheng, since the Amplab version is built upon a different RowMatrix implementation. I'm not sure if it's appropriate to make a direct comparison. I haven't got the time to review the difference carefully. If possible, can you please share more information that has been collected, like some benchmark or capability. In the meantime, I'll do it for my PR also. And what's better is that there have been a plan to migrate the Amplab version to Spark. Let me know if you have any suggestion about how shall we proceed. Thanks. add QR decomposition for RowMatrix -- Key: SPARK-7368 URL: https://issues.apache.org/jira/browse/SPARK-7368 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Original Estimate: 48h Remaining Estimate: 48h Add QR decomposition for RowMatrix. There's a great distributed algorithm for QR decomposition, which I'm currently referring to. Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International Conference on Big Data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7455) Perf test for LDA (EM/online)
[ https://issues.apache.org/jira/browse/SPARK-7455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543065#comment-14543065 ] yuhao yang commented on SPARK-7455: --- I'll start to work on this. Any help or suggestion will be welcome. Perf test for LDA (EM/online) - Key: SPARK-7455 URL: https://issues.apache.org/jira/browse/SPARK-7455 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7496) User guide update for Online LDA
[ https://issues.apache.org/jira/browse/SPARK-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-7496. - Doc updated. Thanks for review. User guide update for Online LDA Key: SPARK-7496 URL: https://issues.apache.org/jira/browse/SPARK-7496 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Reporter: Joseph K. Bradley Assignee: yuhao yang Priority: Minor Fix For: 1.4.0 Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7455) Perf test for LDA (EM/online)
[ https://issues.apache.org/jira/browse/SPARK-7455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551624#comment-14551624 ] yuhao yang commented on SPARK-7455: --- work in progress https://github.com/databricks/spark-perf/pull/70 Perf test for LDA (EM/online) - Key: SPARK-7455 URL: https://issues.apache.org/jira/browse/SPARK-7455 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: yuhao yang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5567) Add prediction methods to LDA
[ https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576358#comment-14576358 ] yuhao yang commented on SPARK-5567: --- I guess the major consideration is proper code reuse. I can provide a quick implementation based on the inference from OnlineLDAOptimizer (simply the gamma computation part). yet I'm not sure if it's appropriate to have LocalLDAModel refer to the methods of OnlineLDAOptimizer. Possible solution includes 1) have a separate OnlineLDAModel, which can invoke the inference of OnlineLDA; 2) Move the inference method to object LDAOptimizer. Add prediction methods to LDA - Key: SPARK-5567 URL: https://issues.apache.org/jira/browse/SPARK-5567 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley LDA currently supports prediction on the training set. E.g., you can call logLikelihood and topicDistributions to get that info for the training data. However, it should support the same functionality for new (test) documents. This will require inference but should be able to use the same code, with a few modification to keep the inferred topics fixed. Note: The API for these methods is already in the code but is commented out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5567) Add prediction methods to LDA
[ https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578595#comment-14578595 ] yuhao yang commented on SPARK-5567: --- Hi Joseph, just to be clear. If we're using the MAP prediction you mentioned, does it require fold-in Gibbs sampling(and convergence) in the prediction process, or just straightforward summation? I checked the implementation in https://github.com/mimno/Mallet/blob/master/src/cc/mallet/topics/TopicInferencer.java#L81. Is it something aligned with your idea? Add prediction methods to LDA - Key: SPARK-5567 URL: https://issues.apache.org/jira/browse/SPARK-5567 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley LDA currently supports prediction on the training set. E.g., you can call logLikelihood and topicDistributions to get that info for the training data. However, it should support the same functionality for new (test) documents. This will require inference but should be able to use the same code, with a few modification to keep the inferred topics fixed. Note: The API for these methods is already in the code but is commented out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8169) Add StopWordsRemover as a transformer
[ https://issues.apache.org/jira/browse/SPARK-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578444#comment-14578444 ] yuhao yang commented on SPARK-8169: --- This looks useful. I'd like to give it a try if no one has started on this. And I think there could be more transformers regarding to text pre-processing. Like the text vectorization in LDA example and low-frequency filter. Some rough ideas: The default stop words will probably contains English only, yet the StopWordsRemover should support ASCII. Case sensitivity will be a parameter. Let me know if I'm missing some requirement. Add StopWordsRemover as a transformer - Key: SPARK-8169 URL: https://issues.apache.org/jira/browse/SPARK-8169 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default. {code} val stopWords = new StopWordsRemover() .setInputCol(words) .setOutputCol(cleanWords) .setStopWords(Array(...)) // optional val output = stopWords.transform(df) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570527#comment-14570527 ] yuhao yang commented on SPARK-7541: --- I find no more issues. Check model save/load for MLlib 1.4 --- Key: SPARK-7541 URL: https://issues.apache.org/jira/browse/SPARK-7541 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: yuhao yang For each model which supports save/load methods, we need to verify: * These methods are tested in unit tests in Scala and Python (if save/load is supported in Python). * If a model's name, data members, or constructors have changed _at all_, then we likely need to support a new save/load format version. Different versions must be tested in unit tests to ensure backwards compatibility (i.e., verify we can load old model formats). * Examples in the programming guide should include save/load when available. It's important to try running each example in the guide whenever it is modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7983) Add require for one-based indices in loadLibSVMFile
[ https://issues.apache.org/jira/browse/SPARK-7983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-7983: -- Priority: Minor (was: Trivial) Add require for one-based indices in loadLibSVMFile --- Key: SPARK-7983 URL: https://issues.apache.org/jira/browse/SPARK-7983 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Priority: Minor Original Estimate: 1h Remaining Estimate: 1h Add require for one-based indices in loadLibSVMFile Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT). add a quick check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8531) Update ML user guide for MinMaxScaler
yuhao yang created SPARK-8531: - Summary: Update ML user guide for MinMaxScaler Key: SPARK-8531 URL: https://issues.apache.org/jira/browse/SPARK-8531 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: yuhao yang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8529) Set metadata for MinMaxScaler
yuhao yang created SPARK-8529: - Summary: Set metadata for MinMaxScaler Key: SPARK-8529 URL: https://issues.apache.org/jira/browse/SPARK-8529 Project: Spark Issue Type: Improvement Components: ML Reporter: yuhao yang Priority: Minor Add this as an reminder for complementing the output metadata for transformer MinMaxScaler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8530) Add Python API for MinMaxScaler
yuhao yang created SPARK-8530: - Summary: Add Python API for MinMaxScaler Key: SPARK-8530 URL: https://issues.apache.org/jira/browse/SPARK-8530 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: yuhao yang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8547) xgboost exploration
[ https://issues.apache.org/jira/browse/SPARK-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597012#comment-14597012 ] yuhao yang commented on SPARK-8547: --- This is definitely useful with many potential users. xgboost exploration --- Key: SPARK-8547 URL: https://issues.apache.org/jira/browse/SPARK-8547 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Joseph K. Bradley There has been quite a bit of excitement around xgboost: [https://github.com/dmlc/xgboost] It improves the parallelism of boosting by mixing boosting and bagging (where bagging makes the algorithm more parallel). It would worth exploring implementing this within MLlib (probably as a new algorithm). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8555) Online Variational Inference for the Hierarchical Dirichlet Process
yuhao yang created SPARK-8555: - Summary: Online Variational Inference for the Hierarchical Dirichlet Process Key: SPARK-8555 URL: https://issues.apache.org/jira/browse/SPARK-8555 Project: Spark Issue Type: Bug Components: MLlib Reporter: yuhao yang Priority: Minor The task is created for exploration on the online HDP algorithm described in http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf. Major advantage for the algorithm: one pass on corpus, streaming friendly, automatic K (topic number). Currently the scope is to support online HDP for topic modeling, i.e. probably an optimizer for LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8555) Online Variational Inference for the Hierarchical Dirichlet Process
[ https://issues.apache.org/jira/browse/SPARK-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-8555: -- Issue Type: New Feature (was: Bug) Online Variational Inference for the Hierarchical Dirichlet Process --- Key: SPARK-8555 URL: https://issues.apache.org/jira/browse/SPARK-8555 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Priority: Minor The task is created for exploration on the online HDP algorithm described in http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf. Major advantage for the algorithm: one pass on corpus, streaming friendly, automatic K (topic number). Currently the scope is to support online HDP for topic modeling, i.e. probably an optimizer for LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8555) Online Variational Inference for the Hierarchical Dirichlet Process
[ https://issues.apache.org/jira/browse/SPARK-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597237#comment-14597237 ] yuhao yang commented on SPARK-8555: --- A basic implementation on https://github.com/hhbyyh/HDP, which still needs a lot of improvement and evaluation on performance and scalability. Online Variational Inference for the Hierarchical Dirichlet Process --- Key: SPARK-8555 URL: https://issues.apache.org/jira/browse/SPARK-8555 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Priority: Minor The task is created for exploration on the online HDP algorithm described in http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf. Major advantage for the algorithm: one pass on corpus, streaming friendly, automatic K (topic number). Currently the scope is to support online HDP for topic modeling, i.e. probably an optimizer for LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8308) add missing save load for python doc example and tune down MatrixFactorization iterations
yuhao yang created SPARK-8308: - Summary: add missing save load for python doc example and tune down MatrixFactorization iterations Key: SPARK-8308 URL: https://issues.apache.org/jira/browse/SPARK-8308 Project: Spark Issue Type: Bug Components: MLlib Reporter: yuhao yang Priority: Minor 1. add some missing save/load in python examples, LogisticRegression, LinearRegression, NaiveBayes 2. tune down iterations for MatrixFactorization, since current number will trigger StackOverflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7541) Check model save/load for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564263#comment-14564263 ] yuhao yang edited comment on SPARK-7541 at 5/29/15 6:40 AM: ||model||Scala UT || python UT || changes ||backwards Compatibility|| |LogisticRegressionModel| LogisticRegressionSuite| LogisticRegressionModel doctests|no public change| y |NaiveBayesModel| NaiveBayesSuite| NaiveBayesModel doctests| save/load 2.0| y| |SVMModel| SVMSuite| SVMModel doctests | no public change| y| |GaussianMixtureModel| GaussianMixtureSuite| checked | New Saveable in 1.4 |New Saveable in 1.4| |KMeansModel| KMeansSuite | KMeansModel doctests| New Saveable in 1.4 |New Saveable in 1.4| |PowerIterationClusteringModel |PowerIterationClusteringSuite| checked | New Saveable in 1.4|New Savable in 1.4| |Word2VecModel | Word2VecSuite | checked | New Saveable in 1.4|New Saveable in 1.4| |MatrixFactorizationModel |MatrixFactorizationModelSuite | MatrixFactorizationModel doctests | no public change | y| |IsotonicRegressionModel| IsotonicRegressionSuite | IsotonicRegressionModel | New Saveable in 1.4 | New Saveable in 1.4| |LassoModel | LassoSuite | LassoModel doctests | no public change| y| |LinearRegressionModel | LinearRegressionSuite | LinearRegressionModel doctests | no public change|y| |RidgeRegressionModel | RidgeRegressionSuite| RidgeRegressionModel doctests | no public change|y| |DecisionTreeModel | DecisionTreeSuite| dt_model.save| no public change| y| |RandomForestModel| RandomForestSuite | rf_model.save | no public change| y| |GradientBoostedTreesModel |GradientBoostedTreesSuite |gbt_model.sav | no public change| y| Above contents have been checked and no obvious issue detected. And Joseph, do you think we should add save/load wherever available in the example documents? was (Author: yuhaoyan): ||model||Scala UT || python UT || changes ||backwards Compatibility|| |LogisticRegressionModel| LogisticRegressionSuite| LogisticRegressionModel doctests|no public change| y |NaiveBayesModel| NaiveBayesSuite| NaiveBayesModel doctests| save/load 2.0| y| |SVMModel| SVMSuite| SVMModel doctests | no public change| y| |GaussianMixtureModel| GaussianMixtureSuite| checked | New Savable in 1.4 |New Savable in 1.4| |KMeansModel| KMeansSuite | KMeansModel doctests| New Savable in 1.4 |New Savable in 1.4| |PowerIterationClusteringModel |PowerIterationClusteringSuite| checked | New Savable in 1.4| New Savable in 1.4| |Word2VecModel | Word2VecSuite | checked | New Savable in 1.4| New Savable in 1.4| |MatrixFactorizationModel |MatrixFactorizationModelSuite | MatrixFactorizationModel doctests | no public change | y| |IsotonicRegressionModel| IsotonicRegressionSuite | IsotonicRegressionModel | New Savable in 1.4 |New Savable in 1.4| |LassoModel | LassoSuite | LassoModel doctests | no public change| y| |LinearRegressionModel | LinearRegressionSuite | LinearRegressionModel doctests | no public change|y| |RidgeRegressionModel | RidgeRegressionSuite| RidgeRegressionModel doctests | no public change|y| |DecisionTreeModel | DecisionTreeSuite| dt_model.save| no public change| y| |RandomForestModel| RandomForestSuite | rf_model.save | no public change| y| |GradientBoostedTreesModel |GradientBoostedTreesSuite |gbt_model.sav |
[jira] [Comment Edited] (SPARK-7541) Check model save/load for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564296#comment-14564296 ] yuhao yang edited comment on SPARK-7541 at 5/29/15 7:14 AM: Oh, checked means I found no python support for save/load for the model. I guess we can add them in 1.5. was (Author: yuhaoyan): Oh, checked means I found no python support for save/load for the model. Check model save/load for MLlib 1.4 --- Key: SPARK-7541 URL: https://issues.apache.org/jira/browse/SPARK-7541 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: yuhao yang For each model which supports save/load methods, we need to verify: * These methods are tested in unit tests in Scala and Python (if save/load is supported in Python). * If a model's name, data members, or constructors have changed _at all_, then we likely need to support a new save/load format version. Different versions must be tested in unit tests to ensure backwards compatibility (i.e., verify we can load old model formats). * Examples in the programming guide should include save/load when available. It's important to try running each example in the guide whenever it is modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564263#comment-14564263 ] yuhao yang commented on SPARK-7541: --- ||model||Scala UT || python UT || changes ||backwards Compatibility|| |LogisticRegressionModel| LogisticRegressionSuite| LogisticRegressionModel doctests|no public change| y |NaiveBayesModel| NaiveBayesSuite| NaiveBayesModel doctests| save/load 2.0| y| |SVMModel| SVMSuite| SVMModel doctests | no public change| y| |GaussianMixtureModel| GaussianMixtureSuite| checked | New Savable in 1.4 |New Savable in 1.4| |KMeansModel| KMeansSuite | KMeansModel doctests| New Savable in 1.4 |New Savable in 1.4| |PowerIterationClusteringModel |PowerIterationClusteringSuite| checked | New Savable in 1.4| New Savable in 1.4| |Word2VecModel | Word2VecSuite | checked | New Savable in 1.4| New Savable in 1.4| |MatrixFactorizationModel |MatrixFactorizationModelSuite | MatrixFactorizationModel doctests | no public change | y| |IsotonicRegressionModel| IsotonicRegressionSuite | IsotonicRegressionModel | New Savable in 1.4 |New Savable in 1.4| |LassoModel | LassoSuite | LassoModel doctests | no public change| y| |LinearRegressionModel | LinearRegressionSuite | LinearRegressionModel doctests | no public change|y| |RidgeRegressionModel | RidgeRegressionSuite| RidgeRegressionModel doctests | no public change|y| |DecisionTreeModel | DecisionTreeSuite| dt_model.save| no public change| y| |RandomForestModel| RandomForestSuite | rf_model.save | no public change| y| |GradientBoostedTreesModel |GradientBoostedTreesSuite |gbt_model.sav | no public change| y| Above contents have been checked and no obvious issue detected. And Joseph, do you think we should add save/load wherever available in the example documents? Check model save/load for MLlib 1.4 --- Key: SPARK-7541 URL: https://issues.apache.org/jira/browse/SPARK-7541 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: yuhao yang For each model which supports save/load methods, we need to verify: * These methods are tested in unit tests in Scala and Python (if save/load is supported in Python). * If a model's name, data members, or constructors have changed _at all_, then we likely need to support a new save/load format version. Different versions must be tested in unit tests to ensure backwards compatibility (i.e., verify we can load old model formats). * Examples in the programming guide should include save/load when available. It's important to try running each example in the guide whenever it is modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564296#comment-14564296 ] yuhao yang commented on SPARK-7541: --- Oh, checked means I found no python support for save/load for the model. Check model save/load for MLlib 1.4 --- Key: SPARK-7541 URL: https://issues.apache.org/jira/browse/SPARK-7541 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: yuhao yang For each model which supports save/load methods, we need to verify: * These methods are tested in unit tests in Scala and Python (if save/load is supported in Python). * If a model's name, data members, or constructors have changed _at all_, then we likely need to support a new save/load format version. Different versions must be tested in unit tests to ensure backwards compatibility (i.e., verify we can load old model formats). * Examples in the programming guide should include save/load when available. It's important to try running each example in the guide whenever it is modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7949) update document with some missing save/load
[ https://issues.apache.org/jira/browse/SPARK-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-7949: -- Description: As part of 7541, add save load for examples: KMeansModel PowerIterationClusteringModel Word2VecModel IsotonicRegressionModel was: add save load for examples: KMeansModel PowerIterationClusteringModel Word2VecModel IsotonicRegressionModel update document with some missing save/load --- Key: SPARK-7949 URL: https://issues.apache.org/jira/browse/SPARK-7949 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Priority: Minor As part of 7541, add save load for examples: KMeansModel PowerIterationClusteringModel Word2VecModel IsotonicRegressionModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7949) update document with some missing save/load
yuhao yang created SPARK-7949: - Summary: update document with some missing save/load Key: SPARK-7949 URL: https://issues.apache.org/jira/browse/SPARK-7949 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Priority: Minor add save load for examples: KMeansModel PowerIterationClusteringModel Word2VecModel IsotonicRegressionModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7983) Add require for one-based indices in loadLibSVMFile
yuhao yang created SPARK-7983: - Summary: Add require for one-based indices in loadLibSVMFile Key: SPARK-7983 URL: https://issues.apache.org/jira/browse/SPARK-7983 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Priority: Trivial Add require for one-based indices in loadLibSVMFile Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT). add a quick check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568998#comment-14568998 ] yuhao yang commented on SPARK-7541: --- Oh, I haven't checked though all the examples in the markdown documents (And I think it's necessary). The previous jira 7949 just added some missing save/load. Check model save/load for MLlib 1.4 --- Key: SPARK-7541 URL: https://issues.apache.org/jira/browse/SPARK-7541 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: yuhao yang For each model which supports save/load methods, we need to verify: * These methods are tested in unit tests in Scala and Python (if save/load is supported in Python). * If a model's name, data members, or constructors have changed _at all_, then we likely need to support a new save/load format version. Different versions must be tested in unit tests to ensure backwards compatibility (i.e., verify we can load old model formats). * Examples in the programming guide should include save/load when available. It's important to try running each example in the guide whenever it is modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7541) Check model save/load for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568998#comment-14568998 ] yuhao yang edited comment on SPARK-7541 at 6/2/15 11:53 AM: Oh, Thanks. Yet I haven't checked through all the examples in the markdown documents (And I think it's necessary). The previous jira 7949 just added some missing save/load. was (Author: yuhaoyan): Oh, I haven't checked though all the examples in the markdown documents (And I think it's necessary). The previous jira 7949 just added some missing save/load. Check model save/load for MLlib 1.4 --- Key: SPARK-7541 URL: https://issues.apache.org/jira/browse/SPARK-7541 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: yuhao yang For each model which supports save/load methods, we need to verify: * These methods are tested in unit tests in Scala and Python (if save/load is supported in Python). * If a model's name, data members, or constructors have changed _at all_, then we likely need to support a new save/load format version. Different versions must be tested in unit tests to ensure backwards compatibility (i.e., verify we can load old model formats). * Examples in the programming guide should include save/load when available. It's important to try running each example in the guide whenever it is modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7541) Check model save/load for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568998#comment-14568998 ] yuhao yang edited comment on SPARK-7541 at 6/2/15 11:56 AM: Oh, Thanks. Yet I haven't checked through all the examples with save/load in the markdown documents (And I think it's necessary). The previous jira 7949 just added some missing save/load. was (Author: yuhaoyan): Oh, Thanks. Yet I haven't checked through all the examples in the markdown documents (And I think it's necessary). The previous jira 7949 just added some missing save/load. Check model save/load for MLlib 1.4 --- Key: SPARK-7541 URL: https://issues.apache.org/jira/browse/SPARK-7541 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: yuhao yang For each model which supports save/load methods, we need to verify: * These methods are tested in unit tests in Scala and Python (if save/load is supported in Python). * If a model's name, data members, or constructors have changed _at all_, then we likely need to support a new save/load format version. Different versions must be tested in unit tests to ensure backwards compatibility (i.e., verify we can load old model formats). * Examples in the programming guide should include save/load when available. It's important to try running each example in the guide whenever it is modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8043) update NaiveBayes and SVM examples in doc
yuhao yang created SPARK-8043: - Summary: update NaiveBayes and SVM examples in doc Key: SPARK-8043 URL: https://issues.apache.org/jira/browse/SPARK-8043 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: yuhao yang Priority: Minor I found some issues during testing the save/load examples in markdown Documents, as a part of 1.4 QA plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7949) update document with some missing save/load
[ https://issues.apache.org/jira/browse/SPARK-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568339#comment-14568339 ] yuhao yang commented on SPARK-7949: --- Oh thanks, I thought we should close jira when the code work is done. Shall I reopen it ? update document with some missing save/load --- Key: SPARK-7949 URL: https://issues.apache.org/jira/browse/SPARK-7949 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Assignee: yuhao yang Priority: Minor Fix For: 1.4.0 As part of 7541, add save load for examples: KMeansModel PowerIterationClusteringModel Word2VecModel IsotonicRegressionModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7949) update document with some missing save/load
[ https://issues.apache.org/jira/browse/SPARK-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-7949. - update document with some missing save/load --- Key: SPARK-7949 URL: https://issues.apache.org/jira/browse/SPARK-7949 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Assignee: yuhao yang Priority: Minor Fix For: 1.4.0 As part of 7541, add save load for examples: KMeansModel PowerIterationClusteringModel Word2VecModel IsotonicRegressionModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8744) StringIndexerModel should have public constructor
[ https://issues.apache.org/jira/browse/SPARK-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610558#comment-14610558 ] yuhao yang commented on SPARK-8744: --- Just a reminder: There seems to be more jobs to do than simply change the access modifiers. Since a passed-in labels will have a larger chance to trigger the unseen label exception. Perhaps we should address the exception first. StringIndexerModel should have public constructor - Key: SPARK-8744 URL: https://issues.apache.org/jira/browse/SPARK-8744 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Trivial Labels: starter Original Estimate: 48h Remaining Estimate: 48h It would be helpful to allow users to pass a pre-computed index to create an indexer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8744) StringIndexerModel should have public constructor
[ https://issues.apache.org/jira/browse/SPARK-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610558#comment-14610558 ] yuhao yang edited comment on SPARK-8744 at 7/1/15 4:10 PM: --- There seems to be more jobs than simply changing the access modifiers. Since a passed-in labels will have a larger chance to trigger the unseen label exception. Perhaps we should address the exception first. was (Author: yuhaoyan): Just a reminder: There seems to be more jobs to do than simply change the access modifiers. Since a passed-in labels will have a larger chance to trigger the unseen label exception. Perhaps we should address the exception first. StringIndexerModel should have public constructor - Key: SPARK-8744 URL: https://issues.apache.org/jira/browse/SPARK-8744 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Trivial Labels: starter Original Estimate: 48h Remaining Estimate: 48h It would be helpful to allow users to pass a pre-computed index to create an indexer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector
[ https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609703#comment-14609703 ] yuhao yang commented on SPARK-8703: --- Thanks Joseph. It's true that CountVectorizer and HashingTF share similar input and output, yet currently CountVectorizer does not actually inherit anything useful from HashingTF. And I kind of like the current clean separation among the feature transformers. I'm prone to undo the extension. About code reuse, given HashingTF is invoking the version in mllib and the fact that it's a quite straightforward implementation, it may not be necessary to do any refactor for code reuse. [~viirya] and [~fliang]. Thanks for your opinions and I'd like to know your thoughts about it. Add CountVectorizer as a ml transformer to convert document to words count vector - Key: SPARK-8703 URL: https://issues.apache.org/jira/browse/SPARK-8703 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Converts a text document to a sparse vector of token counts. Similar to http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html I can further add an estimator to extract vocabulary from corpus if that's appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector
yuhao yang created SPARK-8703: - Summary: Add CountVectorizer as a ml transformer to convert document to words count vector Key: SPARK-8703 URL: https://issues.apache.org/jira/browse/SPARK-8703 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Converts a text document to a sparse vector of token counts. I can further add an estimator to extract vocabulary from corpus if that's appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org