Hi Chris, I had a similar problem to what you describe. It turned out that many of the words I wanted to "stop" are also words with high document frequency. In order to avoid these words one option is to use maxDFPercent, but there are to issues with this: 1. You should know what exactly percentage to select 2. It works only on the tfidf vectors and not on the tf ones (LDA uses the latter)
You can take a look at https://issues.apache.org/jira/browse/MAHOUT-688which provides one possible solution. On Thu, May 5, 2011 at 4:27 PM, Chris McConnell <[email protected]>wrote: > Hi guys, > > I'm jumping back as the later emails jump into expansions (all of > which sound great), but I wanted to give this a better link back to > the original question. > > This adjustment allowed me to get the vectors created, create the lda > input and grab the topics out of the final results. > > I'm curious if anyone has done testing with the parameters at all. > Obviously different data will lead to different parameter needs > (number of topics, smoothing, iterations, etc.) but I'm wondering > particularly about "stop words." I believe I ran across some older > questions in the mailing list about this, where users were curious if > they could be specified in Mahout, or if we should be doing so within > the Lucene index creation, others? > > Another thought I had, we have the dictionary output, if we were to > modify the dictionary to remove those stop words, would that have a > similar effect, or does the algorithm (haven't had a chance to dig > into it yet, so I apologize if this is obvious) require every word > within the vector to exist in the dictionary? > > Thanks for all the help, I'm excited this chain has gathered some > steam within the community to improve the algorithm(s) surrounding > LDA, as we (GE) feel this library has great potential. > > Best, > Chris > > bin/mahout lda -i /user/TopicTrending/ -o > /user/TopicTrending/lda_output/ -k 5 -v 50000 > > On Tue, May 3, 2011 at 12:22 PM, Jake Mannix <[email protected]> > wrote: > > Hi Chris, > > > > That's what I thought. This line needs to make sure you store > termvectors > > (see this article< > http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/ > >for > > more details): > > > > On Tue, May 3, 2011 at 8:32 AM, Chris McConnell > > <[email protected]>wrote: > >> > >> if (elementName.equals("doc")) { > >> if(title && content){ > >> doc.add(new > >> Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED)); > >> doc.add(new > >> Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED)); > > > > > > You want this to be: > > > > new Field("content", contentStr, Field.Store.YES, Field.Index.ANALYZED, > > Field.TermVector.YES); > > > > Although technically, we could add the capability to take a Store.YES > field > > and re-tokenize and > > build vectors from this as well. > > > > -jake > > >
