Timmy,

  Mahout's current LDA implementation would probably have problems with the
vocabulary size of this data set (it requires the full model [of size:
numTopics *
vocabSize] to live in memory in all mappers), but I've got another variation
of
this codebase which scales a lot better on my GitHub Mahout
<https://github.com/jakemannix/Mahout>fork, branch
name is "cvb0".  But it hasn't been integrated with Mahout trunk on account
of needing quite a bit more documentation and cleanup.

  Hopefully I can get some time to clean that code up and get it into Mahout
trunk, as I have seen it pull off a 10-16x speedup over the current impl
(and
isn't memory limited in any sense at all, although it *is* a bit heavy on
disk
usage: c.f. http://twitter.com/#!/lintool/status/104271708420190208 ).

  I'll second Ted's suggestion of Yahoo's LDA, everyone I know who's tried
it has been super-impressed with its performance.

  -jake

On Sun, Sep 18, 2011 at 3:53 AM, Timmy Wilson <[email protected]> wrote:

> Hi,
>
> I'm considering using LDA to cluster a large social graph.
>
> Users are documents, relationships are terms --
> http://www.machinedlearnings.com/2011/03/lda-on-social-graph.html
>
> I want to scale to 25M documents @ 200terms/doc (on average).
>
> Is this reasonable, are there examples of LDA usage @ this scale?
>
> Thanks,
> Timmy Wilson
>

Reply via email to