The patch for pruning words with high document frequencies is ready: https://issues.apache.org/jira/browse/MAHOUT-688
On Thu, Apr 28, 2011 at 5:08 PM, Vasil Vasilev <[email protected]> wrote: > Also the topic regularization patch is ready: > https://issues.apache.org/jira/browse/MAHOUT-684 > > > On Thu, Apr 28, 2011 at 10:53 AM, Vasil Vasilev <[email protected]>wrote: > >> Hi all, >> >> The LDA Vectorization patch is ready. You can take a look at: >> https://issues.apache.org/jira/browse/MAHOUT-683* >> >> *Regards, Vasil >> * >> * >> On Thu, Apr 21, 2011 at 9:47 AM, Vasil Vasilev <[email protected]>wrote: >> >>> Ok. I am going to try out 1) suggested by Jake, then write couple of >>> tests and then I will file the Jira-s. >>> >>> >>> On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll <[email protected]>wrote: >>> >>>> >>>> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote: >>>> >>>> > Hi Mahouters, >>>> > >>>> > I was experimenting with the LDA clustering algorithm on the Reuters >>>> data >>>> > set and I did several enhancements, which if you find interesting I >>>> could >>>> > contribute to the project: >>>> > >>>> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and >>>> not >>>> > the tf-idf ones which result from seq2sparse. Due this fact words like >>>> > "and", "where", etc. get also included in the resulting topics. To >>>> prevent >>>> > that I run seq2sparse with the whole tf-idf sequence and then run the >>>> > "pruner". It first calculates the standard deviation of the document >>>> > frequencies of the words and then prunes all entries in the tf vectors >>>> whose >>>> > document frequency is bigger then 3 times the calculated standard >>>> deviation. >>>> > This ensures including most of the words population, but still pruning >>>> the >>>> > unnecessary ones. >>>> > >>>> > 2. Implemented the alpha-estimation part of the LDA algorithm as >>>> described >>>> > in the Blei, Ng, Jordan paper. This leads to better results in >>>> maximizing >>>> > the log-likelihood for the same number of iterations. Just an example >>>> - for >>>> > 20 iterations on the reuters data set the enhanced algorithm reaches >>>> value >>>> > of -6975124.693072233, compared to -7304552.275676554 with the >>>> original >>>> > implementation >>>> > >>>> > 3. Created LDA Vectorizer. It executes only the inference part of the >>>> LDA >>>> > algorithm based on the last LDA state and the input document vectors >>>> and for >>>> > each vector produces a vector of the gammas, that are result of the >>>> > inference. The idea is that the vectors produced in this way can be >>>> used for >>>> > clustering with any of the existing algorithms (like canopy, kmeans, >>>> etc.) >>>> > >>>> >>>> As Jake says, this all sounds great. Please see: >>>> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute >>>> >>>> >>> >> >
