Hi, My question is about interpreting lda document-topics output.
I am using trunk. I have a directory of documents, each of which are named by integers, and there is no sub-directory of the data directory. The directory structure is as follows $ ls /path/to/data/ 1 2 5 ... >From those documents I want to detect topics, and output: - topic - top terms - document - top topics To this end, I first run seqdirectory on the directory: $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1 Then I run seq2sparse to create tf vectors of documents: $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3 --namedVector After creating vectors, I run cvb0_local on those tf-vectors: $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to $LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0 And to interpret the results, I use mahout's vectordump utility: $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize 10 -sort true -p true $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize 10 -sort true -p true The resulting words file consists of #ofTopics lines. I assume each line is in <topicID \t wordsVector> format, where a wordsVector is a sorted vector whose elements are <word, score> pairs. The resulting docs file on the other hand, consists of #ofDocuments lines. I assume each line is in <documentID \t topicsVector> format, where a topicsVector is a sorted vector whose elements are <topicID, probability> pairs. The problem is that the documentID field does not match with the original document ids. This field is populated with zero-based auto-incrementing indices. I want to ask if I am missing something for vectordump to output correct document ids, or this is the normal behavior when one runs lda on a directory of documents, or I make a mistake in one of those steps. I suspect the issue is seqdirectory assigns Text ids to documents, while CVB algorithm expects documents in another format, <IntWritable, VectorWritable>. If this is the case, could you help me for assigning IntWritable ids to documents in the process of creating vectors from them? Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do so? Thanks -- Gokhan
