[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-03-17 Thread Matthew Willson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365073#comment-14365073
 ] 

Matthew Willson commented on SPARK-5563:


No thank you for working on this! great stuff.

Vowpal is indeed scarily fast :)

If you're looking for more suggestions, one other quick win for LDA is the 
ability to seed topics with specific keywords by specifying non-symmetric 
Dirichlet priors for some of the topic-word distributions. This was very easy 
to add to gensim's online LDA implementation for example.

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-03-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364350#comment-14364350
 ] 

yuhao yang commented on SPARK-5563:
---

Matthew Willson. Thanks for the attention and idea. Apart from Gensim, 
vowpal-wabbit also has a distributed implementation provided by Matthew D. 
Hoffman, which seems to be amazingly fast. I'll refer to those libraries as 
much as possible. And suggestions are always welcome.

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-03-16 Thread Matthew Willson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363308#comment-14363308
 ] 

Matthew Willson commented on SPARK-5563:


Definitely keen on this!

In case it's useful, for reference gensim (https://github.com/piskvorky/gensim 
) has a great Python/numpy implementation of this, alongside various other 
goodies including an online version of HDP (Hierarchical Dirichlet Process) 
LDA, which doesn't require the number of topics to be fixed in advance. (I've 
been told that if you're going to implement online variational inference for 
LDA, there isn't much extra cost in going the whole way and implementing it for 
HDP-LDA...)

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-03-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363578#comment-14363578
 ] 

Joseph K. Bradley commented on SPARK-5563:
--

[~matthjw]  That's a good point: For most inference algorithms (sampling  
batch EM), it's significantly more expensive/difficult to go from LDA to the 
HDP, but it may be a lot easier for online variational inference.

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308584#comment-14308584
 ] 

Apache Spark commented on SPARK-5563:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/4419

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-02-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305115#comment-14305115
 ] 

yuhao yang commented on SPARK-5563:
---

Thanks Joseph for helping create the jira.
Paste previous [comment 
link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952]
 here and share the current implementation at 
https://github.com/hhbyyh/OnlineLDA_Spark.

I agree with the suggestion listed above and will propose a PR for more 
detailed discussion soon. Thanks


 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-02-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305199#comment-14305199
 ] 

yuhao yang commented on SPARK-5563:
---

BTW, batch versions of online variational inference is useful when processing 
small data set (especially toy data in ut).

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org