Sounds great Gokhan!

On Mon, Aug 6, 2012 at 2:53 PM, Gokhan Capan <[email protected]> wrote:

> Jake,
>
> I converted the ids to integers with rowid, and then
> modified InMemoryCollapsedVariationBayes0.loadVectors() such that it
> returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are keys
> from <IntWritable, VectorWritable> tf vectors. I am not sure if it works,
> since the values of mapped integer ids (results of rowid) are in the range
> [0, #ofDocuments), but I
> believe it does.
>
> Constructing SparseMatrix needs RandomAccessSparseVector as row vectors and
> tf-vectors are sparse vectors, so I assumed that an incoming tf vector
> itself, or getDelegate if it is a NamedVector, can be cast to
> RandomAccessSparseVector.
> I will submit the diff tomorrow, so you can review and commit.
>
> Thank you for your help.
>
>
> On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix <[email protected]> wrote:
>
> > Hi Gokhan,
> >
> >   This looks like a bug in the
> > InMemoryCollapsedVariationBayes0.loadVectors()
> > method - it takes the SequenceFile<? extends Writable, VectorWritable>
> and
> > ignores
> > the keys, assigning the rows in order into an in-memory Matrix.
> >
> >   If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o
> > <output path>"
> > this converts Text keys into IntWritable keys (and leaves behind an index
> > file, a mapping
> > of Text -> IntWritable which tells you which int is assigned to which
> > original text key).
> >
> >   Then what you'd want to do is modify
> > InMemoryCollapsedVariationBayes0.loadVectors()
> > to actually use the keys which are given to it, instead of reassigning to
> > sequential
> > ids.  If you make this change, we'd love to have the diff - not too many
> > people use
> > the cvb0_local path (it's usually used for debugging and testing smaller
> > data sets to see that topics are converging properly), but getting it to
> > actually produce
> > document -> topic outputs which correlate with original docIds would be
> > very nice! :)
> >
> > On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > My question is about interpreting lda document-topics output.
> > >
> > > I am using trunk.
> > >
> > > I have a directory of documents, each of which are named by integers,
> and
> > > there is no sub-directory of the data directory.
> > > The directory structure is as follows
> > > $ ls /path/to/data/
> > >    1
> > >    2
> > >    5
> > >    ...
> > >
> > > From those documents I want to detect topics, and output:
> > > - topic - top terms
> > > - document - top topics
> > >
> > > To this end, I first run seqdirectory on the directory:
> > > $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1
> > >
> > > Then I run seq2sparse to create tf vectors of documents:
> > > $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3
> > > --namedVector
> > >
> > > After creating vectors, I run cvb0_local on those tf-vectors:
> > > $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to
> > > $LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0
> > >
> > > And to interpret the results, I use mahout's vectordump utility:
> > > $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize
> 10
> > > -sort true -p true
> > >
> > > $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary
> > > $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize
> > 10
> > > -sort true -p true
> > >
> > > The resulting words file consists of #ofTopics lines.
> > > I assume each line is in <topicID \t wordsVector> format, where a
> > > wordsVector is a sorted vector whose elements are <word, score> pairs.
> > >
> > > The resulting docs file on the other hand, consists of #ofDocuments
> > lines.
> > > I assume each line is in <documentID \t topicsVector> format, where a
> > > topicsVector is a sorted vector whose elements are <topicID,
> probability>
> > > pairs.
> > >
> > > The problem is that the documentID field does not match with the
> original
> > > document ids. This field is populated with zero-based auto-incrementing
> > > indices.
> > >
> > > I want to ask if I am missing something for vectordump to output
> correct
> > > document ids, or this is the normal behavior when one runs lda on a
> > > directory of documents, or I make a mistake in one of those steps.
> > >
> > > I suspect the issue is seqdirectory assigns Text ids to documents,
> while
> > > CVB algorithm expects documents in another format, <IntWritable,
> > > VectorWritable>. If this is the case, could you help me for assigning
> > > IntWritable ids to documents in the process of creating vectors from
> > them?
> > > Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do
> > so?
> > >
> > > Thanks
> > >
> > > --
> > > Gokhan
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>
>
>
> --
> Gokhan
>



-- 

  -jake

Reply via email to