Timmy, Mahout's current LDA implementation would probably have problems with the vocabulary size of this data set (it requires the full model [of size: numTopics * vocabSize] to live in memory in all mappers), but I've got another variation of this codebase which scales a lot better on my GitHub Mahout <https://github.com/jakemannix/Mahout>fork, branch name is "cvb0". But it hasn't been integrated with Mahout trunk on account of needing quite a bit more documentation and cleanup.
Hopefully I can get some time to clean that code up and get it into Mahout trunk, as I have seen it pull off a 10-16x speedup over the current impl (and isn't memory limited in any sense at all, although it *is* a bit heavy on disk usage: c.f. http://twitter.com/#!/lintool/status/104271708420190208 ). I'll second Ted's suggestion of Yahoo's LDA, everyone I know who's tried it has been super-impressed with its performance. -jake On Sun, Sep 18, 2011 at 3:53 AM, Timmy Wilson <[email protected]> wrote: > Hi, > > I'm considering using LDA to cluster a large social graph. > > Users are documents, relationships are terms -- > http://www.machinedlearnings.com/2011/03/lda-on-social-graph.html > > I want to scale to 25M documents @ 200terms/doc (on average). > > Is this reasonable, are there examples of LDA usage @ this scale? > > Thanks, > Timmy Wilson >
