One of the strong arguments FOR latent Dirichlet allocation (aka LDA) is that it explicitly maintains a model of counts. Normalizing the document throws that critical information away and is a really bad idea.
On Thu, Jan 26, 2012 at 7:46 PM, Jake Mannix <[email protected]> wrote: > On Thu, Jan 26, 2012 at 4:37 PM, John Conwell <[email protected]> wrote: > > > One more question. What about vector normalization when you vectorize > your > > documents. Would this help with topic model quality? > > > > No, unless you have reason to feel that document length is definitely *not* > an indicator of how much topical information is being provided. So if > you're > building topic models off of webpages, and a page has only 20 words on it, > do you *want* it to have the same impact on the overall topic model as a > big page with 2000 words on it? Maybe you do, if you've got a good reason, > but I can't think of a domain-independent reason to do that. > > > > > > On Tue, Jan 24, 2012 at 4:11 PM, John Conwell <[email protected]> wrote: > > > > > Thanks for all the feedback! You've been a big help. > > > > > > > > > On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <[email protected] > > >wrote: > > > > > >> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <[email protected]> > wrote: > > >> > > >> > Hey Jake, > > >> > Thanks for the tips. That will definitely help. > > >> > > > >> > One more question, do you know if the topic model quality will be > > >> affected > > >> > by the document length? > > >> > > >> > > >> Yes, very much so. > > >> > > >> > > >> > I'm thinking lengths ranging from tweets (~20 words), > > >> > > >> > > >> Tweets suck. Trust me on this. ;) > > >> > > >> > > >> > to emails (hundreds of words), > > >> > > >> > > >> Fantastic size. > > >> > > >> > > >> > to whitepapers (thousands of words) > > >> > > > >> > > >> Can be pretty great too. > > >> > > >> > > >> > to books (boat loads of words). > > >> > > >> > > >> This is too long. There will be tons and tons of topics in a book, > > often. > > >> But, frankly, I have not tried with huge documents personally, so I > > can't > > >> say from experience that it won't work. I'd just not be terribly > > >> surprised > > >> if it didn't work well at all. If I had a bunch of books I wanted to > > run > > >> LDA > > >> on, I'd maybe treat each page or each chapter as a separate document. > > >> > > >> -jake > > >> > > >> What lengths'ish would degrade topic model > > >> > quality. > > >> > > > >> > I would think tweets would kind'a suck, but what about longer docs? > > >> Should > > >> > they be segmented into sub-documents? > > >> > > > >> > Thanks, > > >> > JohnC > > >> > > > >> > > > >> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix < > [email protected]> > > >> > wrote: > > >> > > > >> > > Hi John, > > >> > > > > >> > > I'm not an expert in the field, but I have done a bit of work > > >> building > > >> > > topic > > >> > > models with LDA, and here are some of the "tricks" I've used: > > >> > > > > >> > > 1) yes remove stop words, in fact remove all words occurring in > > more > > >> > than > > >> > > (say) half (or more conservatively, 90%) of your documents, as > > >> they'll be > > >> > > noise > > >> > > and just dominate your topics. > > >> > > > > >> > > 2) more features is better, if you have the memory for it (note > > that > > >> > > mahout's > > >> > > LDA currently holds numTopics * numFeatures in memory in the > mapper > > >> > tasks, > > >> > > which means that you are usually bounded to a few hundred thousand > > >> > > features, > > >> > > maybe up as high as a million, currently). So don't stem, and > throw > > >> in > > >> > > commonly occurring (or more importantly: high log-likelihood) > > bigrams > > >> and > > >> > > trigrams as independent features. > > >> > > > > >> > > 3) violate the underlying assumption of LDA, that you're talking > > >> about > > >> > > "token > > >> > > occurrences", and weight your vectors not as "tf", but "tf*idf", > > which > > >> > > makes rarer > > >> > > features more prominent, which ends up making your topics look a > lot > > >> > nicer. > > >> > > > > >> > > Those are the main tricks I can think of right now. > > >> > > > > >> > > If you're using Mahout trunk, try the new LDA impl: > > >> > > > > >> > > $MAHOUT_HOME/bin/mahout cvb0 --help > > >> > > > > >> > > It operates on the same kind of input as the last one (ie. a > corpus > > >> which > > >> > > is > > >> > > a SequenceFile<IntWritable, VectorWritable>). > > >> > > > > >> > > -jake > > >> > > > > >> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <[email protected]> > > >> wrote: > > >> > > > > >> > > > I'm trying to find out if there are any standard best practices > > for > > >> > > > document tokenization when prepping your data for LDA in order > to > > >> get a > > >> > > > higher quality topic model, and to understand how the feature > > space > > >> > > affects > > >> > > > topic model quality. > > >> > > > > > >> > > > For example, will the topic model be "better" if there is a more > > >> rich > > >> > > > feature space by not stemming terms, or is it better to have a > > more > > >> > > > normalized feature space by applying stemming? > > >> > > > > > >> > > > Is it better to filter out stop words, or keep them in? > > >> > > > > > >> > > > Is it better to include bi and/or tri grams of highly correlated > > >> terms > > >> > in > > >> > > > the feature space? > > >> > > > > > >> > > > In essence what characteristics of the feature space that LDA > uses > > >> for > > >> > > > input will create a higher quality topic model. > > >> > > > > > >> > > > Thanks, > > >> > > > JohnC > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > -- > > >> > > > >> > Thanks, > > >> > John C > > >> > > > >> > > > > > > > > > > > > -- > > > > > > Thanks, > > > John C > > > > > > > > > > > > -- > > > > Thanks, > > John C > > >
