Hi Mahouters, I was experimenting with the LDA clustering algorithm on the Reuters data set and I did several enhancements, which if you find interesting I could contribute to the project:
1. Created term-frequency vectors pruner: LDA uses the tf vectors and not the tf-idf ones which result from seq2sparse. Due this fact words like "and", "where", etc. get also included in the resulting topics. To prevent that I run seq2sparse with the whole tf-idf sequence and then run the "pruner". It first calculates the standard deviation of the document frequencies of the words and then prunes all entries in the tf vectors whose document frequency is bigger then 3 times the calculated standard deviation. This ensures including most of the words population, but still pruning the unnecessary ones. 2. Implemented the alpha-estimation part of the LDA algorithm as described in the Blei, Ng, Jordan paper. This leads to better results in maximizing the log-likelihood for the same number of iterations. Just an example - for 20 iterations on the reuters data set the enhanced algorithm reaches value of -6975124.693072233, compared to -7304552.275676554 with the original implementation 3. Created LDA Vectorizer. It executes only the inference part of the LDA algorithm based on the last LDA state and the input document vectors and for each vector produces a vector of the gammas, that are result of the inference. The idea is that the vectors produced in this way can be used for clustering with any of the existing algorithms (like canopy, kmeans, etc.) Regards, Vasil
