Hi guys, I'm jumping back as the later emails jump into expansions (all of which sound great), but I wanted to give this a better link back to the original question.
This adjustment allowed me to get the vectors created, create the lda input and grab the topics out of the final results. I'm curious if anyone has done testing with the parameters at all. Obviously different data will lead to different parameter needs (number of topics, smoothing, iterations, etc.) but I'm wondering particularly about "stop words." I believe I ran across some older questions in the mailing list about this, where users were curious if they could be specified in Mahout, or if we should be doing so within the Lucene index creation, others? Another thought I had, we have the dictionary output, if we were to modify the dictionary to remove those stop words, would that have a similar effect, or does the algorithm (haven't had a chance to dig into it yet, so I apologize if this is obvious) require every word within the vector to exist in the dictionary? Thanks for all the help, I'm excited this chain has gathered some steam within the community to improve the algorithm(s) surrounding LDA, as we (GE) feel this library has great potential. Best, Chris bin/mahout lda -i /user/TopicTrending/ -o /user/TopicTrending/lda_output/ -k 5 -v 50000 On Tue, May 3, 2011 at 12:22 PM, Jake Mannix <[email protected]> wrote: > Hi Chris, > > That's what I thought. This line needs to make sure you store termvectors > (see this > article<http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/>for > more details): > > On Tue, May 3, 2011 at 8:32 AM, Chris McConnell > <[email protected]>wrote: >> >> if (elementName.equals("doc")) { >> if(title && content){ >> doc.add(new >> Field("title",titleStr,Field.Store.YES,Field.Index.ANALYZED)); >> doc.add(new >> Field("content",contentStr,Field.Store.YES,Field.Index.ANALYZED)); > > > You want this to be: > > new Field("content", contentStr, Field.Store.YES, Field.Index.ANALYZED, > Field.TermVector.YES); > > Although technically, we could add the capability to take a Store.YES field > and re-tokenize and > build vectors from this as well. > > -jake >
