One more question. What about vector normalization when you vectorize your documents. Would this help with topic model quality?
On Tue, Jan 24, 2012 at 4:11 PM, John Conwell <[email protected]> wrote: > Thanks for all the feedback! You've been a big help. > > > On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <[email protected]>wrote: > >> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <[email protected]> wrote: >> >> > Hey Jake, >> > Thanks for the tips. That will definitely help. >> > >> > One more question, do you know if the topic model quality will be >> affected >> > by the document length? >> >> >> Yes, very much so. >> >> >> > I'm thinking lengths ranging from tweets (~20 words), >> >> >> Tweets suck. Trust me on this. ;) >> >> >> > to emails (hundreds of words), >> >> >> Fantastic size. >> >> >> > to whitepapers (thousands of words) >> > >> >> Can be pretty great too. >> >> >> > to books (boat loads of words). >> >> >> This is too long. There will be tons and tons of topics in a book, often. >> But, frankly, I have not tried with huge documents personally, so I can't >> say from experience that it won't work. I'd just not be terribly >> surprised >> if it didn't work well at all. If I had a bunch of books I wanted to run >> LDA >> on, I'd maybe treat each page or each chapter as a separate document. >> >> -jake >> >> What lengths'ish would degrade topic model >> > quality. >> > >> > I would think tweets would kind'a suck, but what about longer docs? >> Should >> > they be segmented into sub-documents? >> > >> > Thanks, >> > JohnC >> > >> > >> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <[email protected]> >> > wrote: >> > >> > > Hi John, >> > > >> > > I'm not an expert in the field, but I have done a bit of work >> building >> > > topic >> > > models with LDA, and here are some of the "tricks" I've used: >> > > >> > > 1) yes remove stop words, in fact remove all words occurring in more >> > than >> > > (say) half (or more conservatively, 90%) of your documents, as >> they'll be >> > > noise >> > > and just dominate your topics. >> > > >> > > 2) more features is better, if you have the memory for it (note that >> > > mahout's >> > > LDA currently holds numTopics * numFeatures in memory in the mapper >> > tasks, >> > > which means that you are usually bounded to a few hundred thousand >> > > features, >> > > maybe up as high as a million, currently). So don't stem, and throw >> in >> > > commonly occurring (or more importantly: high log-likelihood) bigrams >> and >> > > trigrams as independent features. >> > > >> > > 3) violate the underlying assumption of LDA, that you're talking >> about >> > > "token >> > > occurrences", and weight your vectors not as "tf", but "tf*idf", which >> > > makes rarer >> > > features more prominent, which ends up making your topics look a lot >> > nicer. >> > > >> > > Those are the main tricks I can think of right now. >> > > >> > > If you're using Mahout trunk, try the new LDA impl: >> > > >> > > $MAHOUT_HOME/bin/mahout cvb0 --help >> > > >> > > It operates on the same kind of input as the last one (ie. a corpus >> which >> > > is >> > > a SequenceFile<IntWritable, VectorWritable>). >> > > >> > > -jake >> > > >> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <[email protected]> >> wrote: >> > > >> > > > I'm trying to find out if there are any standard best practices for >> > > > document tokenization when prepping your data for LDA in order to >> get a >> > > > higher quality topic model, and to understand how the feature space >> > > affects >> > > > topic model quality. >> > > > >> > > > For example, will the topic model be "better" if there is a more >> rich >> > > > feature space by not stemming terms, or is it better to have a more >> > > > normalized feature space by applying stemming? >> > > > >> > > > Is it better to filter out stop words, or keep them in? >> > > > >> > > > Is it better to include bi and/or tri grams of highly correlated >> terms >> > in >> > > > the feature space? >> > > > >> > > > In essence what characteristics of the feature space that LDA uses >> for >> > > > input will create a higher quality topic model. >> > > > >> > > > Thanks, >> > > > JohnC >> > > > >> > > >> > >> > >> > >> > -- >> > >> > Thanks, >> > John C >> > >> > > > > -- > > Thanks, > John C > > -- Thanks, John C
