Jake, I converted the ids to integers with rowid, and then modified InMemoryCollapsedVariationBayes0.loadVectors() such that it returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are keys from <IntWritable, VectorWritable> tf vectors. I am not sure if it works, since the values of mapped integer ids (results of rowid) are in the range [0, #ofDocuments), but I believe it does.
Constructing SparseMatrix needs RandomAccessSparseVector as row vectors and tf-vectors are sparse vectors, so I assumed that an incoming tf vector itself, or getDelegate if it is a NamedVector, can be cast to RandomAccessSparseVector. I will submit the diff tomorrow, so you can review and commit. Thank you for your help. On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix <[email protected]> wrote: > Hi Gokhan, > > This looks like a bug in the > InMemoryCollapsedVariationBayes0.loadVectors() > method - it takes the SequenceFile<? extends Writable, VectorWritable> and > ignores > the keys, assigning the rows in order into an in-memory Matrix. > > If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o > <output path>" > this converts Text keys into IntWritable keys (and leaves behind an index > file, a mapping > of Text -> IntWritable which tells you which int is assigned to which > original text key). > > Then what you'd want to do is modify > InMemoryCollapsedVariationBayes0.loadVectors() > to actually use the keys which are given to it, instead of reassigning to > sequential > ids. If you make this change, we'd love to have the diff - not too many > people use > the cvb0_local path (it's usually used for debugging and testing smaller > data sets to see that topics are converging properly), but getting it to > actually produce > document -> topic outputs which correlate with original docIds would be > very nice! :) > > On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <[email protected]> wrote: > > > Hi, > > > > My question is about interpreting lda document-topics output. > > > > I am using trunk. > > > > I have a directory of documents, each of which are named by integers, and > > there is no sub-directory of the data directory. > > The directory structure is as follows > > $ ls /path/to/data/ > > 1 > > 2 > > 5 > > ... > > > > From those documents I want to detect topics, and output: > > - topic - top terms > > - document - top topics > > > > To this end, I first run seqdirectory on the directory: > > $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1 > > > > Then I run seq2sparse to create tf vectors of documents: > > $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3 > > --namedVector > > > > After creating vectors, I run cvb0_local on those tf-vectors: > > $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to > > $LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0 > > > > And to interpret the results, I use mahout's vectordump utility: > > $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize 10 > > -sort true -p true > > > > $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary > > $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize > 10 > > -sort true -p true > > > > The resulting words file consists of #ofTopics lines. > > I assume each line is in <topicID \t wordsVector> format, where a > > wordsVector is a sorted vector whose elements are <word, score> pairs. > > > > The resulting docs file on the other hand, consists of #ofDocuments > lines. > > I assume each line is in <documentID \t topicsVector> format, where a > > topicsVector is a sorted vector whose elements are <topicID, probability> > > pairs. > > > > The problem is that the documentID field does not match with the original > > document ids. This field is populated with zero-based auto-incrementing > > indices. > > > > I want to ask if I am missing something for vectordump to output correct > > document ids, or this is the normal behavior when one runs lda on a > > directory of documents, or I make a mistake in one of those steps. > > > > I suspect the issue is seqdirectory assigns Text ids to documents, while > > CVB algorithm expects documents in another format, <IntWritable, > > VectorWritable>. If this is the case, could you help me for assigning > > IntWritable ids to documents in the process of creating vectors from > them? > > Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do > so? > > > > Thanks > > > > -- > > Gokhan > > > > > > -- > > -jake > -- Gokhan
