On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <[email protected]> wrote:
> Hey Jake, > Thanks for the tips. That will definitely help. > > One more question, do you know if the topic model quality will be affected > by the document length? Yes, very much so. > I'm thinking lengths ranging from tweets (~20 words), Tweets suck. Trust me on this. ;) > to emails (hundreds of words), Fantastic size. > to whitepapers (thousands of words) > Can be pretty great too. > to books (boat loads of words). This is too long. There will be tons and tons of topics in a book, often. But, frankly, I have not tried with huge documents personally, so I can't say from experience that it won't work. I'd just not be terribly surprised if it didn't work well at all. If I had a bunch of books I wanted to run LDA on, I'd maybe treat each page or each chapter as a separate document. -jake What lengths'ish would degrade topic model > quality. > > I would think tweets would kind'a suck, but what about longer docs? Should > they be segmented into sub-documents? > > Thanks, > JohnC > > > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <[email protected]> > wrote: > > > Hi John, > > > > I'm not an expert in the field, but I have done a bit of work building > > topic > > models with LDA, and here are some of the "tricks" I've used: > > > > 1) yes remove stop words, in fact remove all words occurring in more > than > > (say) half (or more conservatively, 90%) of your documents, as they'll be > > noise > > and just dominate your topics. > > > > 2) more features is better, if you have the memory for it (note that > > mahout's > > LDA currently holds numTopics * numFeatures in memory in the mapper > tasks, > > which means that you are usually bounded to a few hundred thousand > > features, > > maybe up as high as a million, currently). So don't stem, and throw in > > commonly occurring (or more importantly: high log-likelihood) bigrams and > > trigrams as independent features. > > > > 3) violate the underlying assumption of LDA, that you're talking about > > "token > > occurrences", and weight your vectors not as "tf", but "tf*idf", which > > makes rarer > > features more prominent, which ends up making your topics look a lot > nicer. > > > > Those are the main tricks I can think of right now. > > > > If you're using Mahout trunk, try the new LDA impl: > > > > $MAHOUT_HOME/bin/mahout cvb0 --help > > > > It operates on the same kind of input as the last one (ie. a corpus which > > is > > a SequenceFile<IntWritable, VectorWritable>). > > > > -jake > > > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <[email protected]> wrote: > > > > > I'm trying to find out if there are any standard best practices for > > > document tokenization when prepping your data for LDA in order to get a > > > higher quality topic model, and to understand how the feature space > > affects > > > topic model quality. > > > > > > For example, will the topic model be "better" if there is a more rich > > > feature space by not stemming terms, or is it better to have a more > > > normalized feature space by applying stemming? > > > > > > Is it better to filter out stop words, or keep them in? > > > > > > Is it better to include bi and/or tri grams of highly correlated terms > in > > > the feature space? > > > > > > In essence what characteristics of the feature space that LDA uses for > > > input will create a higher quality topic model. > > > > > > Thanks, > > > JohnC > > > > > > > > > -- > > Thanks, > John C >
