On Thu, Jan 26, 2012 at 4:37 PM, John Conwell <[email protected]> wrote:
> One more question. What about vector normalization when you vectorize your > documents. Would this help with topic model quality? > No, unless you have reason to feel that document length is definitely *not* an indicator of how much topical information is being provided. So if you're building topic models off of webpages, and a page has only 20 words on it, do you *want* it to have the same impact on the overall topic model as a big page with 2000 words on it? Maybe you do, if you've got a good reason, but I can't think of a domain-independent reason to do that. > > On Tue, Jan 24, 2012 at 4:11 PM, John Conwell <[email protected]> wrote: > > > Thanks for all the feedback! You've been a big help. > > > > > > On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <[email protected] > >wrote: > > > >> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <[email protected]> wrote: > >> > >> > Hey Jake, > >> > Thanks for the tips. That will definitely help. > >> > > >> > One more question, do you know if the topic model quality will be > >> affected > >> > by the document length? > >> > >> > >> Yes, very much so. > >> > >> > >> > I'm thinking lengths ranging from tweets (~20 words), > >> > >> > >> Tweets suck. Trust me on this. ;) > >> > >> > >> > to emails (hundreds of words), > >> > >> > >> Fantastic size. > >> > >> > >> > to whitepapers (thousands of words) > >> > > >> > >> Can be pretty great too. > >> > >> > >> > to books (boat loads of words). > >> > >> > >> This is too long. There will be tons and tons of topics in a book, > often. > >> But, frankly, I have not tried with huge documents personally, so I > can't > >> say from experience that it won't work. I'd just not be terribly > >> surprised > >> if it didn't work well at all. If I had a bunch of books I wanted to > run > >> LDA > >> on, I'd maybe treat each page or each chapter as a separate document. > >> > >> -jake > >> > >> What lengths'ish would degrade topic model > >> > quality. > >> > > >> > I would think tweets would kind'a suck, but what about longer docs? > >> Should > >> > they be segmented into sub-documents? > >> > > >> > Thanks, > >> > JohnC > >> > > >> > > >> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <[email protected]> > >> > wrote: > >> > > >> > > Hi John, > >> > > > >> > > I'm not an expert in the field, but I have done a bit of work > >> building > >> > > topic > >> > > models with LDA, and here are some of the "tricks" I've used: > >> > > > >> > > 1) yes remove stop words, in fact remove all words occurring in > more > >> > than > >> > > (say) half (or more conservatively, 90%) of your documents, as > >> they'll be > >> > > noise > >> > > and just dominate your topics. > >> > > > >> > > 2) more features is better, if you have the memory for it (note > that > >> > > mahout's > >> > > LDA currently holds numTopics * numFeatures in memory in the mapper > >> > tasks, > >> > > which means that you are usually bounded to a few hundred thousand > >> > > features, > >> > > maybe up as high as a million, currently). So don't stem, and throw > >> in > >> > > commonly occurring (or more importantly: high log-likelihood) > bigrams > >> and > >> > > trigrams as independent features. > >> > > > >> > > 3) violate the underlying assumption of LDA, that you're talking > >> about > >> > > "token > >> > > occurrences", and weight your vectors not as "tf", but "tf*idf", > which > >> > > makes rarer > >> > > features more prominent, which ends up making your topics look a lot > >> > nicer. > >> > > > >> > > Those are the main tricks I can think of right now. > >> > > > >> > > If you're using Mahout trunk, try the new LDA impl: > >> > > > >> > > $MAHOUT_HOME/bin/mahout cvb0 --help > >> > > > >> > > It operates on the same kind of input as the last one (ie. a corpus > >> which > >> > > is > >> > > a SequenceFile<IntWritable, VectorWritable>). > >> > > > >> > > -jake > >> > > > >> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <[email protected]> > >> wrote: > >> > > > >> > > > I'm trying to find out if there are any standard best practices > for > >> > > > document tokenization when prepping your data for LDA in order to > >> get a > >> > > > higher quality topic model, and to understand how the feature > space > >> > > affects > >> > > > topic model quality. > >> > > > > >> > > > For example, will the topic model be "better" if there is a more > >> rich > >> > > > feature space by not stemming terms, or is it better to have a > more > >> > > > normalized feature space by applying stemming? > >> > > > > >> > > > Is it better to filter out stop words, or keep them in? > >> > > > > >> > > > Is it better to include bi and/or tri grams of highly correlated > >> terms > >> > in > >> > > > the feature space? > >> > > > > >> > > > In essence what characteristics of the feature space that LDA uses > >> for > >> > > > input will create a higher quality topic model. > >> > > > > >> > > > Thanks, > >> > > > JohnC > >> > > > > >> > > > >> > > >> > > >> > > >> > -- > >> > > >> > Thanks, > >> > John C > >> > > >> > > > > > > > > -- > > > > Thanks, > > John C > > > > > > > -- > > Thanks, > John C >
