Ivan, Mahout LDA input:
1) a set of (document id, term vector) pairs in SequenceFile<IntWritable, VectorWritable> format. 2) optionally, a dictionary of (term, term index) pairs in SequenceFile<IntWritable, Text> format. Output: 1) a "model"; set of (topic index, term vector) pairs in SequenceFile<IntWritable, VectorWritable> format. Topic identifiers are zero-based indices. 2) optionally, a set of (document id, topic vector) pairs in SequenceFile<IntWritable, VectorWritable> format. This is inference output of the trained model on input #1 above. Note that the topic vectors have cardinality equal to the number of latent topics you trained with (e.g. 50, 100) and are dense. An entry k in document d's topic vector represets the model's estimate of p(topic = k | doc = d). Andy @sagemintblue On Mon, Jul 2, 2012 at 5:54 AM, ivan obeso <[email protected]>wrote: > Hi, > > I would like to know wich is the order of the documents in the LDA running > results. For example, I know that the topic/word file is a group of > IntWritable keys with VectorWritable values, and the key corresponds with > the topic id and the intWritable have in position 0 the word in position 0 > in the dictionary file.... > > but in the document/topic file I am not sure about the order followed. The > key is an IntWritable that represents the document ID, but i dont know > where to read the filename/docID table. > > Thanks. >
