Ok, i have started to migrate my program to mahout 0.6 with the new LDA version. Before all, Im doing all with java code, no command line programs.
The problem is that i cant use the sequence files that i generated for the old version. I writed it with a SequenceFile.Writer writing a Text as Key and a Text as Value. Now its not allowed because the CVB0Driver wants a IntWritable as key. I know that I have to use the SparseVectorsFromSequenceFiles class to convert my sequencefiles to the input files that CVB0Driver wants. Is that correct? My problem is the lack of documentantion about this classes. I dont know how to use SparseVectorsFromSequenceFiles's run method. Can someone explain me the usage of this? Also, I need a page explaining all the parameters of the CVB0Driver's run method (¡¡19 parameters!! are too much). Because i dont know the meaning of some of them and i dont find any usefull information :( Thanks. On Tue, May 8, 2012 at 1:00 PM, Jake Mannix <[email protected]> wrote: > Hi Ivan, > > First off, let me say that you should probably start migrating to using > the new > LDA implementation which came in 0.6, which is invoked via the "mahout > cvb..." > command, or by directly launching the o.a.m.clustering.lda.cvb.CVB0Driver > in your code, as the old LDA which you're referencing will be going away > soon. > > But for now, I'll try to answer your questions on the old impl: > > On Tue, May 8, 2012 at 8:54 AM, ivan obeso <[email protected] > >wrote: > > > Im using mahout 0.6. I had runned the "mahout lda..." tool for command > line > > for apply lda method in a corpus. But now, i want to code it in my java > > program and Im having a lot of problems because it crashes. Can someone > > give me an example java code running correctly? > > > > Looking at the output of LDA, I have 2 folders: > > - docTopics: wich contains a Text key (the document ID) and a vector > Value > > (that is the membership of this document to each topic). > > -state-n: I assume that the intPairWritable is (topicID, wordID) so it > have > > as wordID as all the corpus for each topic. And the DoubleWritable Value > I > > dont know what is. I think its the membership between the topic and the > > word, but i dont know what type of meassure method is used. For example, > > here is an split that I have printed: > > > > You're correct here - the values are unnormalized log( p(wordId | topicId) > ) > values. To recover probabilities, you need to exponentiate them, and > normalize > so that if you sum over all the values for a given topicId, the sum == 1. > > > > ... > > (4, 17847) -28.424714110200803 > > (4, 17848) -32.54168874531223 > > (4, 17849) -51.954687480087074 > > (4, 17850) -1.8811618929248652E-12 > > (4, 17851) -7.102634146221668 > > (4, 17852) 3.440324743165531 > > (4, 17853) 1.118778127312393 > > (4, 17854) 2.2973859313207385 > > (4, 17855) 2.1602327860824015 > > (4, 17856) -2.5362957334351677E-6 > > (4, 17857) -32.80559170476965 > > (4, 17858) -1.9791269423308222E-7 > > ... > > > > Can somebody help me explaining me this? > > > > > > -- > > -jake >
