On Mon, Jun 7, 2010 at 5:06 AM, Avishay Livne1 <[email protected]> wrote:
> I modified
> $MAHOUT_HOME/utils/src/main/java/org/apache/mahout/clustering/lda/LDAPrintTopics.java
>  so the score is printed along each word., but the interpretation of the
> scores is somewhat obscure.
> I see values in the range of -8 to +6. I assumed the values should
> represent P(word | topic) or  log(P(word | topic)) but these values are of
> different range.
> How should I interpret these values? Is there a simple way to retrieve P
> (word | topic)?

Sorry about that. The scores are log p(word|topic) + constant, because
they're normalized online during the E-step, and so the serialized
values don't need to be serialized. You can normalize them by
computing the log-sum of all of those values and subtracting.

>
> Thanks,
> Avishay.
>
>
>
>  From:       Avishay Livne1/Haifa/i...@ibmil
>
>  To:         [email protected]
>
>  Date:       06/06/2010 03:16 PM
>
>  Subject:    extract p(doc|topic) from LDA
>
>
>
>
>
>
>
> Hi,
>
> I'm trying to use LDA for a collaborative filtering task, where I need to
> predict the rating a user (document) will give to a movie (word).
> I ran LDA and constructed T topics, but I can only print the most frequent
> words (movies) per topic.
> Is it possible to extract p(documet|topic) or p(word|topic) from LDA's
> output? (document = new user, word = movie).
>
> Best regards,
> Avishay
>
>
>
>
>

Reply via email to