Re: LDA related enhancements

Vasil Vasilev Thu, 28 Apr 2011 00:54:33 -0700

Hi all,

The LDA Vectorization patch is ready. You can take a look at:
https://issues.apache.org/jira/browse/MAHOUT-683*


*Regards, Vasil*
*
On Thu, Apr 21, 2011 at 9:47 AM, Vasil Vasilev <[email protected]> wrote:

> Ok. I am going to try out 1) suggested by Jake, then write couple of tests
> and then I will file the Jira-s.
>
>
> On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll <[email protected]>wrote:
>
>>
>> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:
>>
>> > Hi Mahouters,
>> >
>> > I was experimenting with the LDA clustering algorithm on the Reuters
>> data
>> > set and I did several enhancements, which if you find interesting I
>> could
>> > contribute to the project:
>> >
>> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and
>> not
>> > the tf-idf ones which result from seq2sparse. Due this fact words like
>> > "and", "where", etc. get also included in the resulting topics. To
>> prevent
>> > that I run seq2sparse with the whole tf-idf sequence and then run the
>> > "pruner". It first calculates the standard deviation of the document
>> > frequencies of the words and then prunes all entries in the tf vectors
>> whose
>> > document frequency is bigger then 3 times the calculated standard
>> deviation.
>> > This ensures including most of the words population, but still pruning
>> the
>> > unnecessary ones.
>> >
>> > 2. Implemented the alpha-estimation part of the LDA algorithm as
>> described
>> > in the Blei, Ng, Jordan paper. This leads to better results in
>> maximizing
>> > the log-likelihood for the same number of iterations. Just an example -
>> for
>> > 20 iterations on the reuters data set the enhanced algorithm reaches
>> value
>> > of -6975124.693072233, compared to -7304552.275676554 with the original
>> > implementation
>> >
>> > 3. Created LDA Vectorizer. It executes only the inference part of the
>> LDA
>> > algorithm based on the last LDA state and the input document vectors and
>> for
>> > each vector produces a vector of the gammas, that are result of the
>> > inference. The idea is that the vectors produced in this way can be used
>> for
>> > clustering with any of the existing algorithms (like canopy, kmeans,
>> etc.)
>> >
>>
>> As Jake says, this all sounds great.  Please see:
>> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
>>
>>
>

Re: LDA related enhancements

Reply via email to