Very interesting, thanks for the pointer Vasil! We've been playing
around a bit, even removing some of the stop words during the Lucene
index creation which is helping a bit as well.

Thanks again for this link.

Best,
Chris

On Wed, May 11, 2011 at 11:10 AM, Vasil Vasilev <[email protected]> wrote:
> Hi Chris,
>
> I had a similar problem to what you describe. It turned out that many of the
> words I wanted to "stop" are also words with high document frequency.
> In order to avoid these words one option is to use maxDFPercent, but there
> are to issues with this:
> 1. You should know what exactly percentage to select
> 2. It works only on the tfidf vectors and not on the tf ones (LDA uses the
> latter)
>
> You can take a look at
> https://issues.apache.org/jira/browse/MAHOUT-688which provides one
> possible solution.
>
> On Thu, May 5, 2011 at 4:27 PM, Chris McConnell
> <[email protected]>wrote:
>
>> Hi guys,
>>
>> I'm jumping back as the later emails jump into expansions (all of
>> which sound great), but I wanted to give this a better link back to
>> the original question.
>>
>> This adjustment allowed me to get the vectors created, create the lda
>> input and grab the topics out of the final results.
>>
>> I'm curious if anyone has done testing with the parameters at all.
>> Obviously different data will lead to different parameter needs
>> (number of topics, smoothing, iterations, etc.) but I'm wondering
>> particularly about "stop words." I believe I ran across some older
>> questions in the mailing list about this, where users were curious if
>> they could be specified in Mahout, or if we should be doing so within
>> the Lucene index creation, others?
>>
>> Another thought I had, we have the dictionary output, if we were to
>> modify the dictionary to remove those stop words, would that have a
>> similar effect, or does the algorithm (haven't had a chance to dig
>> into it yet, so I apologize if this is obvious) require every word
>> within the vector to exist in the dictionary?
>>
>> Thanks for all the help, I'm excited this chain has gathered some
>> steam within the community to improve the algorithm(s) surrounding
>> LDA, as we (GE) feel this library has great potential.
>>
>> Best,
>> Chris
>>
>> bin/mahout lda -i /user/TopicTrending/ -o
>> /user/TopicTrending/lda_output/ -k 5 -v 50000
>>
>> On Tue, May 3, 2011 at 12:22 PM, Jake Mannix <[email protected]>
>> wrote:
>> > Hi Chris,
>> >
>> >  That's what I thought.  This line needs to make sure you store
>> termvectors
>> > (see this article<
>> http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/
>> >for
>> > more details):
>> >
>> > On Tue, May 3, 2011 at 8:32 AM, Chris McConnell
>> > <[email protected]>wrote:
>> >>
>> >> if (elementName.equals("doc")) {
>> >>                if(title && content){
>> >>                                doc.add(new
>> >> Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED));
>> >>                                doc.add(new
>> >> Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED));
>> >
>> >
>> > You want this to be:
>> >
>> > new Field("content", contentStr, Field.Store.YES, Field.Index.ANALYZED,
>> > Field.TermVector.YES);
>> >
>> > Although technically, we could add the capability to take a Store.YES
>> field
>> > and re-tokenize and
>> > build vectors from this as well.
>> >
>> >  -jake
>> >
>>
>

Reply via email to