Re: LDA in Mahout

Neal Richter Thu, 06 Jan 2011 11:08:54 -0800

On Thu, Jan 6, 2011 at 9:22 AM, Ted Dunning <[email protected]> wrote:


> The topics in LDA are not the same as topics in normal parlance.  They are
> abstract, internal probabilistic distributions.
>
>
Yes.. it also depends on the LDA-extracted topics existing in the document's
text or meta-data.


> That said, your suggestion is a reasonable one.  If you use the LDA topic
> distribution for each document as a feature vector for a supervised model
> then it is pretty easy to argue that LDA distributions that give better
> model performance are better at capturing content.  The supervised step is
> necessary, however, since there is not guarantee that the LDA topics will
> have a simple relationship to human assigned categories.
>

If one thinks of the LDA outputting a distribution of topics for a given
document... then at some point a real decision is made to output N topic
labels... it looks like a classifier now.

I'm suggesting that one can do a classification accuracy test of the LDA
predicted label set with a set of human generated labels from tagging data.

1) Document.DataVector
2) Document.LabelsVector
3) Run LDA on Document.DataVector to generate Document.ExtractedTopicsVector

Compute accuracy by comparing Document.LabelsVector
to Document.ExtractedTopicsVector

There will be misses if the human labeled/tagged term or phrase does not
exist within the document's text or metadata. LDA can't see these unless
some augmentation/inference step is run on the document vector prior to LDA
input.

So it's not really supervised as there is no training.... just the 2nd-stage
testing part of supervised learning.

- Neal



>
> On Wed, Jan 5, 2011 at 11:57 PM, Neal Richter <[email protected]> wrote:
>
> > What about gauging it's ability to predict the topics of labeled data?
> >
> > 1) Grab RSS feeds of blog posts and use the tags as labels
> > 2) Delicious bookmarks & their content versus user tags
> > 3) other examples abound...
> >
> > On Tue, Jan 4, 2011 at 10:33 AM, Jake Mannix <[email protected]>
> > wrote:
> >
> > > Saying we have hashing is different than saying we know what will
> happen
> > to
> > > an algorithm once its running over hashed features (as the continuing
> > work
> > > on our Stochastic SVD demonstrates).
> > >
> > > I can certainly try to run LDA over a hashed vector set, but I'm not
> sure
> > > what criteria for correctness / quality of the topic model I should use
> > if
> > > I
> > > do.
> > >
> > >  -jake
> > >
> > > On Jan 4, 2011 7:21 AM, "Robin Anil" <[email protected]> wrote:
> > >
> > > We already have the second part - the hashing trick. Thanks to Ted, and
> > he
> > > has a mechanism to partially reverse engineer the feature as well. You
> > > might
> > > be able to drop it directly in the job itself or even vectorize and
> then
> > > run
> > > LDA.
> > >
> > > Robin
> > >
> > > On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <[email protected]>
> > wrote:
> > > >
> > > Hey Robin, > > Vowp...
> > >
> >
>

Re: LDA in Mahout

Reply via email to