On Thu, Jun 30, 2011 at 12:02 PM, wine lover <[email protected]> wrote:
> Thanks, Hector, you are right, the exact meaning of topic_i is not > necessary > for unsupervised clustering. > > However, in order to cluster a set of documents, I still need to know the > probabilistic relationship between topic and each document. I am not very > clear how to get this kind of information from the generated result. > > For instance, model [p(model|topic_0) = 0.010358664102351409 Here, model > is > a word, but the result does not tell me anything between this word and a > given document? Thanks. > The current release of Mahout does produce the p(topic | document) probabilities, it gets emitted after the final iteration, and is in a sequence file in the same directory as the model outputs. I think it's called "docTopics" or something like that? -jake > > On Thu, Jun 30, 2011 at 2:08 PM, wine lover <[email protected]> wrote: > > > Hello Everyone, > > > > I have two questions on the LDA analysis. > > > > After running the command of lda, under the generated directory of > > "testdata-lda", there have several folders: docTopics state-0 state-1 > > .... > > > > It seems to me that those folders of "state-x" will be transferred into > > readable format after running "ldatopics". But what does the folder of > > "docTopics" stand for? How can I view it? > > > > Running the command of ldatopics generates 20 files, (topic_0, topic_1, > > etc), in total. For instance, in the file of topic_0, I get information > such > > as follows: > > model [p(model|topic_0) = 0.010358664102351409 > > tissues [p(tissues|topic_0) = 0.008870984984037485 > > > > How can I tell what does topic_0 stand for? Where to find this kind of > > information? Moreover, is there any other procedures existed to generate > > the clustering result based on these topic_x files. > > > > > > Thank you very much for the help. > > > > Wenyia > > >
