What exactly is this probability distribution? For each word in your vocabulary 
it is the probability that a randomly drawn word from a topic is that word. 
Another way to visualise it is a 2-column vector where the 1st column is a word 
in your vocabulary and the 2nd column is the probability of that word 
appearing. All the values in the 2nd-column must be >= 0 and if you add up all 
the values they should sum to 1. That is the definition of a probability 
distribution. 

Clearly for the idea of topics to be at all useful you want different topics to 
exhibit different probability distributions i.e. some words to be more likely 
in 1 topic compared to another topic.

How does it actually infer words and topics? Probably a good idea to google for 
that one if you really want to understand the details - there are some great 
resources available.

How can I connect the output to the actual words in each topic? A typical way 
is to look at the top 5, 10 or 20 words in each topic and use those to infer 
something about what the topic represents.
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 3 Dec 2015, at 05:07, Nguyen, Tiffany T <nguye...@grinnell.edu> wrote:
> 
> Hello,
> 
> I have been trying to understand the LDA topic modeling example provided 
> here: 
> https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda
>  
> <https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda>.
>  In the example, they load word count vectors from a text file that contains 
> these word counts and then they output the topics, which is represented as 
> probability distributions over words. What exactly is this probability 
> distribution? How does it actually infer words and topics and how can I 
> connect the output to the actual words in each topic?
> 
> Thanks!

Reply via email to