Re: LDA topic modeling and Spark

2015-12-03 Thread Robin East
What exactly is this probability distribution? For each word in your vocabulary 
it is the probability that a randomly drawn word from a topic is that word. 
Another way to visualise it is a 2-column vector where the 1st column is a word 
in your vocabulary and the 2nd column is the probability of that word 
appearing. All the values in the 2nd-column must be >= 0 and if you add up all 
the values they should sum to 1. That is the definition of a probability 
distribution. 

Clearly for the idea of topics to be at all useful you want different topics to 
exhibit different probability distributions i.e. some words to be more likely 
in 1 topic compared to another topic.

How does it actually infer words and topics? Probably a good idea to google for 
that one if you really want to understand the details - there are some great 
resources available.

How can I connect the output to the actual words in each topic? A typical way 
is to look at the top 5, 10 or 20 words in each topic and use those to infer 
something about what the topic represents.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 3 Dec 2015, at 05:07, Nguyen, Tiffany T  wrote:
> 
> Hello,
> 
> I have been trying to understand the LDA topic modeling example provided 
> here: 
> https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda
>  
> .
>  In the example, they load word count vectors from a text file that contains 
> these word counts and then they output the topics, which is represented as 
> probability distributions over words. What exactly is this probability 
> distribution? How does it actually infer words and topics and how can I 
> connect the output to the actual words in each topic?
> 
> Thanks!



LDA topic modeling and Spark

2015-12-02 Thread Nguyen, Tiffany T
Hello,

I have been trying to understand the LDA topic modeling example provided here: 
https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda.
 In the example, they load word count vectors from a text file that contains 
these word counts and then they output the topics, which is represented as 
probability distributions over words. What exactly is this probability 
distribution? How does it actually infer words and topics and how can I connect 
the output to the actual words in each topic?

Thanks!