So I tried Yahoo LDA on 52 M documents with 1000 topics. Yahoo LDA with a dictionary of 100k terms does 1 iteration every 30 minutes on a single machine using 4 cores.
Mahout LDA using 20 nodes and a dictionary of 30k takes more than a day for an iteration and didn't complete (something about output error during the reduce step - this may be a CDHbeta3 issue not sure, since reuters clusters fine). Hopefully the ideas from the Yahoo version can be incorporated into the Mahout LDA. On Fri, Jun 10, 2011 at 6:49 AM, Federico Castanedo <[email protected] > wrote: > Hi all, > > i got through the referenced paper and seems that besides all the > distributed tasks the way the inference for \alpha and \beta > is performed was the key element on improved the LDA trained performance. > They use SGD for the hyperparameter adjustment of \alpha. > > bests, > Federico > > 2011/6/10 Jake Mannix <[email protected]> > > > It's all c++, custom distributed processing, custom distributed > > coordination > > and storage. > > > > We can certainly try to port over the algorithmic ideas, but the > > distributed > > systems stuff would be a significant departure from our current setup - > > it's > > not a web service and it's not hadoop, and it's not a command line > utility > > - > > it's a cluster of long-running processes all intercommunicating. Sounds > > awesome, but that's a way's off from where we are now. > > > > -jake > > > > On Thu, Jun 9, 2011 at 7:52 PM, Stanley Xu <[email protected]> wrote: > > > > > Awesome! Guess it would be much faster than then current version in > > Mahout. > > > Is that possible to just use this version in mahout? > > > > > > On Fri, Jun 10, 2011 at 8:12 AM, <[email protected]> wrote: > > > > > > > Yahoo released its hadoop code for LDA > > > > > > > > > > > > > > http://blog.smola.org/post/6359713161/speeding-up-latent-dirichlet-allocation > > > > > > > > > > > > > > > > > > > > > > > > > > -- Yee Yang Li Hector http://hectorgon.blogspot.com/ (tech + travel) http://hectorgon.com (book reviews)
