Very interesting, thanks for the pointer Vasil! We've been playing around a bit, even removing some of the stop words during the Lucene index creation which is helping a bit as well.
Thanks again for this link. Best, Chris On Wed, May 11, 2011 at 11:10 AM, Vasil Vasilev <[email protected]> wrote: > Hi Chris, > > I had a similar problem to what you describe. It turned out that many of the > words I wanted to "stop" are also words with high document frequency. > In order to avoid these words one option is to use maxDFPercent, but there > are to issues with this: > 1. You should know what exactly percentage to select > 2. It works only on the tfidf vectors and not on the tf ones (LDA uses the > latter) > > You can take a look at > https://issues.apache.org/jira/browse/MAHOUT-688which provides one > possible solution. > > On Thu, May 5, 2011 at 4:27 PM, Chris McConnell > <[email protected]>wrote: > >> Hi guys, >> >> I'm jumping back as the later emails jump into expansions (all of >> which sound great), but I wanted to give this a better link back to >> the original question. >> >> This adjustment allowed me to get the vectors created, create the lda >> input and grab the topics out of the final results. >> >> I'm curious if anyone has done testing with the parameters at all. >> Obviously different data will lead to different parameter needs >> (number of topics, smoothing, iterations, etc.) but I'm wondering >> particularly about "stop words." I believe I ran across some older >> questions in the mailing list about this, where users were curious if >> they could be specified in Mahout, or if we should be doing so within >> the Lucene index creation, others? >> >> Another thought I had, we have the dictionary output, if we were to >> modify the dictionary to remove those stop words, would that have a >> similar effect, or does the algorithm (haven't had a chance to dig >> into it yet, so I apologize if this is obvious) require every word >> within the vector to exist in the dictionary? >> >> Thanks for all the help, I'm excited this chain has gathered some >> steam within the community to improve the algorithm(s) surrounding >> LDA, as we (GE) feel this library has great potential. >> >> Best, >> Chris >> >> bin/mahout lda -i /user/TopicTrending/ -o >> /user/TopicTrending/lda_output/ -k 5 -v 50000 >> >> On Tue, May 3, 2011 at 12:22 PM, Jake Mannix <[email protected]> >> wrote: >> > Hi Chris, >> > >> > That's what I thought. This line needs to make sure you store >> termvectors >> > (see this article< >> http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/ >> >for >> > more details): >> > >> > On Tue, May 3, 2011 at 8:32 AM, Chris McConnell >> > <[email protected]>wrote: >> >> >> >> if (elementName.equals("doc")) { >> >> if(title && content){ >> >> doc.add(new >> >> Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED)); >> >> doc.add(new >> >> Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED)); >> > >> > >> > You want this to be: >> > >> > new Field("content", contentStr, Field.Store.YES, Field.Index.ANALYZED, >> > Field.TermVector.YES); >> > >> > Although technically, we could add the capability to take a Store.YES >> field >> > and re-tokenize and >> > build vectors from this as well. >> > >> > -jake >> > >> >
