We already have the second part - the hashing trick. Thanks to Ted, and he
has a mechanism to partially reverse engineer the feature as well. You might
be able to drop it directly in the job itself or even vectorize and then run
LDA.

Robin


On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <[email protected]> wrote:

> Hey Robin,
>
>  Vowpal Wabbit is scalable for numDocs by being a streaming system, and
> scalable for numFeatures by using hashing, and for time by being blazingly
> fast.
>
>   I'm unfortunately just a novice LDA coder, so my attempts around
> deciphering VW's LDA impl (to see if there is anything we can learn from it
> which we aren't doing yet) have been... slow.
>
>  One thing we could do is write a streaming form of our current MR LDA, and
> see at what scale it actually starts to help.
>
>  -jake
>
> On Jan 3, 2011 9:33 PM, "Robin Anil" <[email protected]> wrote:
>
> Jake, take a look at Vowpal Wabbit 5.0. I saw an incremental LDA
> implementation there. Might be scalable
>
> On Tue, Jan 4, 2011 at 6:21 AM, Jake Mannix <[email protected]> wrote:
> >
> Hey all, > > tl;dr ...
> > MAHOUT-458 <https://issues.apache.org/jira/browse/MAHOUT-458> among
> other
>
> > things, which seems to have been closed even though it was never
> committed, > nor was its function...
> > Wikipedia<http://markmail.org/message/ua5hckybpkj3stdl>),
>
> > this puts an absolute cap on the size of the possible vocabulary
> (numTerms
> > * > numTopics * 8byte...
>

Reply via email to