Re: Token filtering and LDA quality

Ted Dunning Thu, 26 Jan 2012 20:10:25 -0800

One of the strong arguments FOR latent Dirichlet allocation (aka LDA) is
that it explicitly maintains a model of counts.  Normalizing the document
throws that critical information away and is a really bad idea.


On Thu, Jan 26, 2012 at 7:46 PM, Jake Mannix <[email protected]> wrote:

> On Thu, Jan 26, 2012 at 4:37 PM, John Conwell <[email protected]> wrote:
>
> > One more question.  What about vector normalization when you vectorize
> your
> > documents.  Would this help with topic model quality?
> >
>
> No, unless you have reason to feel that document length is definitely *not*
> an indicator of how much topical information is being provided.  So if
> you're
> building topic models off of webpages, and a page has only 20 words on it,
> do you *want* it to have the same impact on the overall topic model as a
> big page with 2000 words on it?  Maybe you do, if you've got a good reason,
> but I can't think of a domain-independent reason to do that.
>
>
> >
> > On Tue, Jan 24, 2012 at 4:11 PM, John Conwell <[email protected]> wrote:
> >
> > > Thanks for all the feedback!  You've been a big help.
> > >
> > >
> > > On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <[email protected]
> > >wrote:
> > >
> > >> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <[email protected]>
> wrote:
> > >>
> > >> > Hey Jake,
> > >> > Thanks for the tips.  That will definitely help.
> > >> >
> > >> > One more question, do you know if the topic model quality will be
> > >> affected
> > >> > by the document length?
> > >>
> > >>
> > >> Yes, very much so.
> > >>
> > >>
> > >> >  I'm thinking lengths ranging from tweets (~20 words),
> > >>
> > >>
> > >> Tweets suck.  Trust me on this. ;)
> > >>
> > >>
> > >> > to emails (hundreds of words),
> > >>
> > >>
> > >> Fantastic size.
> > >>
> > >>
> > >> > to whitepapers (thousands of words)
> > >> >
> > >>
> > >> Can be pretty great too.
> > >>
> > >>
> > >> > to books (boat loads of words).
> > >>
> > >>
> > >> This is too long.  There will be tons and tons of topics in a book,
> > often.
> > >> But, frankly, I have not tried with huge documents personally, so I
> > can't
> > >> say from experience that it won't work.  I'd just not be terribly
> > >> surprised
> > >> if it didn't work well at all.  If I had a bunch of books I wanted to
> > run
> > >> LDA
> > >> on, I'd maybe treat each page or each chapter as a separate document.
> > >>
> > >>  -jake
> > >>
> > >>  What lengths'ish would degrade topic model
> > >> > quality.
> > >> >
> > >> > I would think tweets would kind'a suck, but what about longer docs?
> > >>  Should
> > >> > they be segmented into sub-documents?
> > >> >
> > >> > Thanks,
> > >> > JohnC
> > >> >
> > >> >
> > >> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <
> [email protected]>
> > >> > wrote:
> > >> >
> > >> > > Hi John,
> > >> > >
> > >> > >  I'm not an expert in the field, but I have done a bit of work
> > >> building
> > >> > > topic
> > >> > > models with LDA, and here are some of the "tricks" I've used:
> > >> > >
> > >> > >  1) yes remove stop words, in fact remove all words occurring in
> > more
> > >> > than
> > >> > > (say) half (or more conservatively, 90%) of your documents, as
> > >> they'll be
> > >> > > noise
> > >> > > and just dominate your topics.
> > >> > >
> > >> > >  2) more features is better, if you have the memory for it (note
> > that
> > >> > > mahout's
> > >> > > LDA currently holds numTopics * numFeatures in memory in the
> mapper
> > >> > tasks,
> > >> > > which means that you are usually bounded to a few hundred thousand
> > >> > > features,
> > >> > > maybe up as high as a million, currently).  So don't stem, and
> throw
> > >> in
> > >> > > commonly occurring (or more importantly: high log-likelihood)
> > bigrams
> > >> and
> > >> > > trigrams as independent features.
> > >> > >
> > >> > >  3) violate the underlying assumption of LDA, that you're talking
> > >> about
> > >> > > "token
> > >> > > occurrences", and weight your vectors not as "tf", but "tf*idf",
> > which
> > >> > > makes rarer
> > >> > > features more prominent, which ends up making your topics look a
> lot
> > >> > nicer.
> > >> > >
> > >> > > Those are the main tricks I can think of right now.
> > >> > >
> > >> > > If you're using Mahout trunk, try the new LDA impl:
> > >> > >
> > >> > >  $MAHOUT_HOME/bin/mahout cvb0 --help
> > >> > >
> > >> > > It operates on the same kind of input as the last one (ie. a
> corpus
> > >> which
> > >> > > is
> > >> > > a SequenceFile<IntWritable, VectorWritable>).
> > >> > >
> > >> > >  -jake
> > >> > >
> > >> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <[email protected]>
> > >> wrote:
> > >> > >
> > >> > > > I'm trying to find out if there are any standard best practices
> > for
> > >> > > > document tokenization when prepping your data for LDA in order
> to
> > >> get a
> > >> > > > higher quality topic model, and to understand how the feature
> > space
> > >> > > affects
> > >> > > > topic model quality.
> > >> > > >
> > >> > > > For example, will the topic model be "better" if there is a more
> > >> rich
> > >> > > > feature space by not stemming terms, or is it better to have a
> > more
> > >> > > > normalized feature space by applying stemming?
> > >> > > >
> > >> > > > Is it better to filter out stop words, or keep them in?
> > >> > > >
> > >> > > > Is it better to include bi and/or tri grams of highly correlated
> > >> terms
> > >> > in
> > >> > > > the feature space?
> > >> > > >
> > >> > > > In essence what characteristics of the feature space that LDA
> uses
> > >> for
> > >> > > > input will create a higher quality topic model.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > JohnC
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > Thanks,
> > >> > John C
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > >
> > > Thanks,
> > > John C
> > >
> > >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
>

Re: Token filtering and LDA quality

Reply via email to