Hi Jake, Today I submitted the diff. It is available at https://issues.apache.org/jira/browse/MAHOUT-1051
Thanks for the advices On Tue, Aug 7, 2012 at 1:06 AM, Jake Mannix <[email protected]> wrote: > Sounds great Gokhan! > > On Mon, Aug 6, 2012 at 2:53 PM, Gokhan Capan <[email protected]> wrote: > > > Jake, > > > > I converted the ids to integers with rowid, and then > > modified InMemoryCollapsedVariationBayes0.loadVectors() such that it > > returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are > keys > > from <IntWritable, VectorWritable> tf vectors. I am not sure if it works, > > since the values of mapped integer ids (results of rowid) are in the > range > > [0, #ofDocuments), but I > > believe it does. > > > > Constructing SparseMatrix needs RandomAccessSparseVector as row vectors > and > > tf-vectors are sparse vectors, so I assumed that an incoming tf vector > > itself, or getDelegate if it is a NamedVector, can be cast to > > RandomAccessSparseVector. > > I will submit the diff tomorrow, so you can review and commit. > > > > Thank you for your help. > > > > > > On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix <[email protected]> > wrote: > > > > > Hi Gokhan, > > > > > > This looks like a bug in the > > > InMemoryCollapsedVariationBayes0.loadVectors() > > > method - it takes the SequenceFile<? extends Writable, VectorWritable> > > and > > > ignores > > > the keys, assigning the rows in order into an in-memory Matrix. > > > > > > If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o > > > <output path>" > > > this converts Text keys into IntWritable keys (and leaves behind an > index > > > file, a mapping > > > of Text -> IntWritable which tells you which int is assigned to which > > > original text key). > > > > > > Then what you'd want to do is modify > > > InMemoryCollapsedVariationBayes0.loadVectors() > > > to actually use the keys which are given to it, instead of reassigning > to > > > sequential > > > ids. If you make this change, we'd love to have the diff - not too > many > > > people use > > > the cvb0_local path (it's usually used for debugging and testing > smaller > > > data sets to see that topics are converging properly), but getting it > to > > > actually produce > > > document -> topic outputs which correlate with original docIds would be > > > very nice! :) > > > > > > On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <[email protected]> > wrote: > > > > > > > Hi, > > > > > > > > My question is about interpreting lda document-topics output. > > > > > > > > I am using trunk. > > > > > > > > I have a directory of documents, each of which are named by integers, > > and > > > > there is no sub-directory of the data directory. > > > > The directory structure is as follows > > > > $ ls /path/to/data/ > > > > 1 > > > > 2 > > > > 5 > > > > ... > > > > > > > > From those documents I want to detect topics, and output: > > > > - topic - top terms > > > > - document - top topics > > > > > > > > To this end, I first run seqdirectory on the directory: > > > > $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1 > > > > > > > > Then I run seq2sparse to create tf vectors of documents: > > > > $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF > --maxDFSigma 3 > > > > --namedVector > > > > > > > > After creating vectors, I run cvb0_local on those tf-vectors: > > > > $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to > > > > $LDA_OUT/words -top 20 -m 50 --dictionary > $SPARSEDIR/dictionary.file-0 > > > > > > > > And to interpret the results, I use mahout's vectordump utility: > > > > $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize > > 10 > > > > -sort true -p true > > > > > > > > $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words > --dictionary > > > > $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile > --vectorSize > > > 10 > > > > -sort true -p true > > > > > > > > The resulting words file consists of #ofTopics lines. > > > > I assume each line is in <topicID \t wordsVector> format, where a > > > > wordsVector is a sorted vector whose elements are <word, score> > pairs. > > > > > > > > The resulting docs file on the other hand, consists of #ofDocuments > > > lines. > > > > I assume each line is in <documentID \t topicsVector> format, where a > > > > topicsVector is a sorted vector whose elements are <topicID, > > probability> > > > > pairs. > > > > > > > > The problem is that the documentID field does not match with the > > original > > > > document ids. This field is populated with zero-based > auto-incrementing > > > > indices. > > > > > > > > I want to ask if I am missing something for vectordump to output > > correct > > > > document ids, or this is the normal behavior when one runs lda on a > > > > directory of documents, or I make a mistake in one of those steps. > > > > > > > > I suspect the issue is seqdirectory assigns Text ids to documents, > > while > > > > CVB algorithm expects documents in another format, <IntWritable, > > > > VectorWritable>. If this is the case, could you help me for assigning > > > > IntWritable ids to documents in the process of creating vectors from > > > them? > > > > Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to > do > > > so? > > > > > > > > Thanks > > > > > > > > -- > > > > Gokhan > > > > > > > > > > > > > > > > -- > > > > > > -jake > > > > > > > > > > > -- > > Gokhan > > > > > > -- > > -jake > -- Gokhan
