Jake,
Thanks for the pending update.
Slightly off topic, if I understand your notes on MAHOUT-897, Gibbs sampling 
would only be feasible in MR implementation that support efficient iteration -- 
Spark, perhaps YARN -- but not for Mahout as currently conceived. In the case 
of Spark, the RDD  is the shared memory that enables faster synchronization 
across samplers. The need for synchronization across local samplers may mean 
that Gibbs sampling is better suited for openmp.
The approach in MAHOUT-897 is understandably similar  to 
http://arxiv.org/pdf/1107.3765 (Using Variational Inference and MapReduce to 
Scale Topic Modeling)
Do you have any recommendations on topic update that might work well (close to 
real time) in practice? 
For example Yao's http://www.cs.umass.edu/~lmyao/papers/fast-topic-model10.pdf 
suggest simple heuristics for identifying novel topics and memory efficient 
streaming update sparseLDA. I would expect that something based on sparseLDA 
would be efficient for online update. 
Charles


On Nov 30, 2011, at 4:14 PM, Jake Mannix wrote:

> On Wed, Nov 30, 2011 at 1:03 PM, Isabel Drost <[email protected]> wrote:
> 
>> On 28.11.2011 bish maten wrote:
>>> mahout ldatopics -i mahout-work/abc/abc-lda/state-20  -d
>>> mahout-work/abc/abc-out-seqdir-sparse-lda/dictionary.file-0  -dt
>>> sequencefile  (there were no errors reported and command worked fine with
>>> following output). Does the output appear ok?
>> 
>> Hmm - this only prints the resulting LDA topics - which command did you
>> use to
>> generate them?
>> 
>> Please also note that Jake is currently working on improving our LDA
>> support, if
>> you are interested in that algorithm it might be interesting for you to
>> look
>> into his patch in https://issues.apache.org/jira/browse/MAHOUT-897
> 
> 
> Yeah, I'm also working on moving away from LDATopic altogether, instead
> using
> VectorDumper + dictionary file and grabbing top N weighted elements in the
> vector
> representing the topic.  We already do this internally at Twitter, I just
> have to get
> that particular patch formatted properly and cleaned up once MAHOUT-897 gets
> committed (which will hopefully be this week).
> 
>  -jake

Reply via email to