Hi Mahouters,

I was experimenting with the LDA clustering algorithm on the Reuters data
set and I did several enhancements, which if you find interesting I could
contribute to the project:

1. Created term-frequency vectors pruner: LDA uses the tf vectors and not
the tf-idf ones which result from seq2sparse. Due this fact words like
"and", "where", etc. get also included in the resulting topics. To prevent
that I run seq2sparse with the whole tf-idf sequence and then run the
"pruner". It first calculates the standard deviation of the document
frequencies of the words and then prunes all entries in the tf vectors whose
document frequency is bigger then 3 times the calculated standard deviation.
This ensures including most of the words population, but still pruning the
unnecessary ones.

2. Implemented the alpha-estimation part of the LDA algorithm as described
in the Blei, Ng, Jordan paper. This leads to better results in maximizing
the log-likelihood for the same number of iterations. Just an example - for
20 iterations on the reuters data set the enhanced algorithm reaches value
of -6975124.693072233, compared to -7304552.275676554 with the original
implementation

3. Created LDA Vectorizer. It executes only the inference part of the LDA
algorithm based on the last LDA state and the input document vectors and for
each vector produces a vector of the gammas, that are result of the
inference. The idea is that the vectors produced in this way can be used for
clustering with any of the existing algorithms (like canopy, kmeans, etc.)

Regards, Vasil

Reply via email to