Sounds great Gokhan! On Mon, Aug 6, 2012 at 2:53 PM, Gokhan Capan <[email protected]> wrote:
> Jake, > > I converted the ids to integers with rowid, and then > modified InMemoryCollapsedVariationBayes0.loadVectors() such that it > returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are keys > from <IntWritable, VectorWritable> tf vectors. I am not sure if it works, > since the values of mapped integer ids (results of rowid) are in the range > [0, #ofDocuments), but I > believe it does. > > Constructing SparseMatrix needs RandomAccessSparseVector as row vectors and > tf-vectors are sparse vectors, so I assumed that an incoming tf vector > itself, or getDelegate if it is a NamedVector, can be cast to > RandomAccessSparseVector. > I will submit the diff tomorrow, so you can review and commit. > > Thank you for your help. > > > On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix <[email protected]> wrote: > > > Hi Gokhan, > > > > This looks like a bug in the > > InMemoryCollapsedVariationBayes0.loadVectors() > > method - it takes the SequenceFile<? extends Writable, VectorWritable> > and > > ignores > > the keys, assigning the rows in order into an in-memory Matrix. > > > > If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o > > <output path>" > > this converts Text keys into IntWritable keys (and leaves behind an index > > file, a mapping > > of Text -> IntWritable which tells you which int is assigned to which > > original text key). > > > > Then what you'd want to do is modify > > InMemoryCollapsedVariationBayes0.loadVectors() > > to actually use the keys which are given to it, instead of reassigning to > > sequential > > ids. If you make this change, we'd love to have the diff - not too many > > people use > > the cvb0_local path (it's usually used for debugging and testing smaller > > data sets to see that topics are converging properly), but getting it to > > actually produce > > document -> topic outputs which correlate with original docIds would be > > very nice! :) > > > > On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <[email protected]> wrote: > > > > > Hi, > > > > > > My question is about interpreting lda document-topics output. > > > > > > I am using trunk. > > > > > > I have a directory of documents, each of which are named by integers, > and > > > there is no sub-directory of the data directory. > > > The directory structure is as follows > > > $ ls /path/to/data/ > > > 1 > > > 2 > > > 5 > > > ... > > > > > > From those documents I want to detect topics, and output: > > > - topic - top terms > > > - document - top topics > > > > > > To this end, I first run seqdirectory on the directory: > > > $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1 > > > > > > Then I run seq2sparse to create tf vectors of documents: > > > $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3 > > > --namedVector > > > > > > After creating vectors, I run cvb0_local on those tf-vectors: > > > $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to > > > $LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0 > > > > > > And to interpret the results, I use mahout's vectordump utility: > > > $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize > 10 > > > -sort true -p true > > > > > > $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary > > > $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize > > 10 > > > -sort true -p true > > > > > > The resulting words file consists of #ofTopics lines. > > > I assume each line is in <topicID \t wordsVector> format, where a > > > wordsVector is a sorted vector whose elements are <word, score> pairs. > > > > > > The resulting docs file on the other hand, consists of #ofDocuments > > lines. > > > I assume each line is in <documentID \t topicsVector> format, where a > > > topicsVector is a sorted vector whose elements are <topicID, > probability> > > > pairs. > > > > > > The problem is that the documentID field does not match with the > original > > > document ids. This field is populated with zero-based auto-incrementing > > > indices. > > > > > > I want to ask if I am missing something for vectordump to output > correct > > > document ids, or this is the normal behavior when one runs lda on a > > > directory of documents, or I make a mistake in one of those steps. > > > > > > I suspect the issue is seqdirectory assigns Text ids to documents, > while > > > CVB algorithm expects documents in another format, <IntWritable, > > > VectorWritable>. If this is the case, could you help me for assigning > > > IntWritable ids to documents in the process of creating vectors from > > them? > > > Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do > > so? > > > > > > Thanks > > > > > > -- > > > Gokhan > > > > > > > > > > > -- > > > > -jake > > > > > > -- > Gokhan > -- -jake
