Re: Token filtering and LDA quality

John Conwell Thu, 26 Jan 2012 16:37:45 -0800

One more question.  What about vector normalization when you vectorize your
documents.  Would this help with topic model quality?


On Tue, Jan 24, 2012 at 4:11 PM, John Conwell <[email protected]> wrote:

> Thanks for all the feedback!  You've been a big help.
>
>
> On Tue, Jan 24, 2012 at 4:04 PM, Jake Mannix <[email protected]>wrote:
>
>> On Tue, Jan 24, 2012 at 3:41 PM, John Conwell <[email protected]> wrote:
>>
>> > Hey Jake,
>> > Thanks for the tips.  That will definitely help.
>> >
>> > One more question, do you know if the topic model quality will be
>> affected
>> > by the document length?
>>
>>
>> Yes, very much so.
>>
>>
>> >  I'm thinking lengths ranging from tweets (~20 words),
>>
>>
>> Tweets suck.  Trust me on this. ;)
>>
>>
>> > to emails (hundreds of words),
>>
>>
>> Fantastic size.
>>
>>
>> > to whitepapers (thousands of words)
>> >
>>
>> Can be pretty great too.
>>
>>
>> > to books (boat loads of words).
>>
>>
>> This is too long.  There will be tons and tons of topics in a book, often.
>> But, frankly, I have not tried with huge documents personally, so I can't
>> say from experience that it won't work.  I'd just not be terribly
>> surprised
>> if it didn't work well at all.  If I had a bunch of books I wanted to run
>> LDA
>> on, I'd maybe treat each page or each chapter as a separate document.
>>
>>  -jake
>>
>>  What lengths'ish would degrade topic model
>> > quality.
>> >
>> > I would think tweets would kind'a suck, but what about longer docs?
>>  Should
>> > they be segmented into sub-documents?
>> >
>> > Thanks,
>> > JohnC
>> >
>> >
>> > On Tue, Jan 24, 2012 at 12:33 PM, Jake Mannix <[email protected]>
>> > wrote:
>> >
>> > > Hi John,
>> > >
>> > >  I'm not an expert in the field, but I have done a bit of work
>> building
>> > > topic
>> > > models with LDA, and here are some of the "tricks" I've used:
>> > >
>> > >  1) yes remove stop words, in fact remove all words occurring in more
>> > than
>> > > (say) half (or more conservatively, 90%) of your documents, as
>> they'll be
>> > > noise
>> > > and just dominate your topics.
>> > >
>> > >  2) more features is better, if you have the memory for it (note that
>> > > mahout's
>> > > LDA currently holds numTopics * numFeatures in memory in the mapper
>> > tasks,
>> > > which means that you are usually bounded to a few hundred thousand
>> > > features,
>> > > maybe up as high as a million, currently).  So don't stem, and throw
>> in
>> > > commonly occurring (or more importantly: high log-likelihood) bigrams
>> and
>> > > trigrams as independent features.
>> > >
>> > >  3) violate the underlying assumption of LDA, that you're talking
>> about
>> > > "token
>> > > occurrences", and weight your vectors not as "tf", but "tf*idf", which
>> > > makes rarer
>> > > features more prominent, which ends up making your topics look a lot
>> > nicer.
>> > >
>> > > Those are the main tricks I can think of right now.
>> > >
>> > > If you're using Mahout trunk, try the new LDA impl:
>> > >
>> > >  $MAHOUT_HOME/bin/mahout cvb0 --help
>> > >
>> > > It operates on the same kind of input as the last one (ie. a corpus
>> which
>> > > is
>> > > a SequenceFile<IntWritable, VectorWritable>).
>> > >
>> > >  -jake
>> > >
>> > > On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <[email protected]>
>> wrote:
>> > >
>> > > > I'm trying to find out if there are any standard best practices for
>> > > > document tokenization when prepping your data for LDA in order to
>> get a
>> > > > higher quality topic model, and to understand how the feature space
>> > > affects
>> > > > topic model quality.
>> > > >
>> > > > For example, will the topic model be "better" if there is a more
>> rich
>> > > > feature space by not stemming terms, or is it better to have a more
>> > > > normalized feature space by applying stemming?
>> > > >
>> > > > Is it better to filter out stop words, or keep them in?
>> > > >
>> > > > Is it better to include bi and/or tri grams of highly correlated
>> terms
>> > in
>> > > > the feature space?
>> > > >
>> > > > In essence what characteristics of the feature space that LDA uses
>> for
>> > > > input will create a higher quality topic model.
>> > > >
>> > > > Thanks,
>> > > > JohnC
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > Thanks,
>> > John C
>> >
>>
>
>
>
> --
>
> Thanks,
> John C
>
>


-- 

Thanks,
John C

Re: Token filtering and LDA quality

Reply via email to