> I'll second Ted's suggestion of Yahoo's LDA, > everyone I know who's tried it has been super-impressed > with its performance.
Cool -- thanks guys!! On Sun, Sep 18, 2011 at 6:06 PM, Jake Mannix <[email protected]> wrote: > Timmy, > > Mahout's current LDA implementation would probably have problems with the > vocabulary size of this data set (it requires the full model [of size: > numTopics * > vocabSize] to live in memory in all mappers), but I've got another variation > of > this codebase which scales a lot better on my GitHub Mahout > <https://github.com/jakemannix/Mahout>fork, branch > name is "cvb0". But it hasn't been integrated with Mahout trunk on account > of needing quite a bit more documentation and cleanup. > > Hopefully I can get some time to clean that code up and get it into Mahout > trunk, as I have seen it pull off a 10-16x speedup over the current impl > (and > isn't memory limited in any sense at all, although it *is* a bit heavy on > disk > usage: c.f. http://twitter.com/#!/lintool/status/104271708420190208 ). > > I'll second Ted's suggestion of Yahoo's LDA, everyone I know who's tried > it has been super-impressed with its performance. > > -jake > > On Sun, Sep 18, 2011 at 3:53 AM, Timmy Wilson <[email protected]> wrote: > >> Hi, >> >> I'm considering using LDA to cluster a large social graph. >> >> Users are documents, relationships are terms -- >> http://www.machinedlearnings.com/2011/03/lda-on-social-graph.html >> >> I want to scale to 25M documents @ 200terms/doc (on average). >> >> Is this reasonable, are there examples of LDA usage @ this scale? >> >> Thanks, >> Timmy Wilson >> >
