[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365073#comment-14365073 ] Matthew Willson commented on SPARK-5563: No thank you for working on this! great stuff. Vowpal is indeed scarily fast :) If you're looking for more suggestions, one other quick win for LDA is the ability to seed topics with specific keywords by specifying non-symmetric Dirichlet priors for some of the topic-word distributions. This was very easy to add to gensim's online LDA implementation for example. LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364350#comment-14364350 ] yuhao yang commented on SPARK-5563: --- Matthew Willson. Thanks for the attention and idea. Apart from Gensim, vowpal-wabbit also has a distributed implementation provided by Matthew D. Hoffman, which seems to be amazingly fast. I'll refer to those libraries as much as possible. And suggestions are always welcome. LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363308#comment-14363308 ] Matthew Willson commented on SPARK-5563: Definitely keen on this! In case it's useful, for reference gensim (https://github.com/piskvorky/gensim ) has a great Python/numpy implementation of this, alongside various other goodies including an online version of HDP (Hierarchical Dirichlet Process) LDA, which doesn't require the number of topics to be fixed in advance. (I've been told that if you're going to implement online variational inference for LDA, there isn't much extra cost in going the whole way and implementing it for HDP-LDA...) LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363578#comment-14363578 ] Joseph K. Bradley commented on SPARK-5563: -- [~matthjw] That's a good point: For most inference algorithms (sampling batch EM), it's significantly more expensive/difficult to go from LDA to the HDP, but it may be a lot easier for online variational inference. LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308584#comment-14308584 ] Apache Spark commented on SPARK-5563: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/4419 LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305115#comment-14305115 ] yuhao yang commented on SPARK-5563: --- Thanks Joseph for helping create the jira. Paste previous [comment link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952] here and share the current implementation at https://github.com/hhbyyh/OnlineLDA_Spark. I agree with the suggestion listed above and will propose a PR for more detailed discussion soon. Thanks LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305199#comment-14305199 ] yuhao yang commented on SPARK-5563: --- BTW, batch versions of online variational inference is useful when processing small data set (especially toy data in ut). LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org