Hi Dan, I've managed to get the text_seq and text_vec generated properly, however when I run:
$MAHOUT_HOME/bin/mahout cvb -i /user/root/text_vec/tf-vectors -o /user/root/text_lda -k 100 -nt 29536 -x 20 -dict /user/root/text_vec/dictionary.file-0 -dt /user/root/text_cvb_document - mt /user/root/text_states I get: 12/08/05 21:18:04 INFO mapred.JobClient: Task Id : attempt_201208051752_0002_m_000003_1, Status : FAILED java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) Task attempt_201208051752_0002_m_000003_1 failed to report status for 600 seconds. Killing! Any ideas what's causing this? Thank you for all the help so far! Kind Regards, Folcon On 2 August 2012 02:41, Folcon Red <[email protected]> wrote: > Thanks Dan, > > Ok, now for some strange reason it(seq and vec appear to have values now, > will test the complete cvb later, I should head to bed...) appears to be > working, The only things I think I changed was I stopped using absolute > paths(referring to text_seq as opposed to /user/root/text_seq) and I'm > using root now instead of sgeadmin. > > Regards, > Folcon > > > On 1 August 2012 03:00, DAN HELM <[email protected]> wrote: > >> Hi Folcon, >> >> There is no reason to rerun seq2sparse as it is clear something is wrongwith >> the text files being processed by >> seqdirectory command. >> >> Based on the keys, I'm assuming the files full path to the input files >> are names like /high/59734, etc. Did you look inside the files to make >> sure there is text in them? >> >> As a test, just create a folder with a simple text file and run that >> through seqdirectory and I'll bet you will then see output from >> seqdumpercommand (from >> seqdirectory output). >> >> Thanks, Dan >> >> *From:* Folcon Red <[email protected]> >> *To:* DAN HELM <[email protected]> >> *Cc:* "[email protected]" <[email protected]> >> *Sent:* Tuesday, July 31, 2012 7:28 PM >> >> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics >> >> Hi Dan, >> >> It's good to know that seqdirectory reads files in subfolders and I've >> dumped out some of the values in the hopes that they will be >> enlightening, The values seem to be missing for both the text_seq and >> the tokenized-documents. >> >> So rerunning some of the commands: >> $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train >> --output /user/sgeadmin/text_seq -c UTF-8 -ow >> $MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o >> /user/sgeadmin/text_vec -wt tf -a >> org.apache.lucene.analysis.WhitespaceAnalyzer -ow >> >> And then doing a seqdumper of text_seq: >> SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text >> [...] >> Key: /high/59734: Value: >> Key: /high/264596: Value: >> Key: /high/341699: Value: >> Key: /high/260770: Value: >> Key: /high/222320: Value: >> Key: /high/198156: Value: >> Key: /high/326011: Value: >> Key: /high/112050: Value: >> Key: /high/306887: Value: >> Key: /high/208169: Value: >> Key: /high/283464: Value: >> Key: /high/168905: Value: >> Count: 2548 >> >> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper -i >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and >> HADOOP_CONF_DIR=/conf >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar >> 12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments: >> {--endPhase=[2147483647], >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000], >> --startPhase=[0], --tempDir=[temp]} >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000 >> Key class: class org.apache.hadoop.io.Text Value Class: class >> org.apache.mahout.math.VectorWritable >> Count: 0 >> >> $MAHOUT_HOME/bin/mahout seqdumper -i >> /user/sgeadmin/text_vec/tokenized-documents/part-m-00000 >> [...] >> Key: /high/396063: Value: [] >> Key: /high/230246: Value: [] >> Key: /high/136284: Value: [] >> Key: /high/59734: Value: [] >> Key: /high/264596: Value: [] >> Key: /high/341699: Value: [] >> Key: /high/260770: Value: [] >> Key: /high/222320: Value: [] >> Key: /high/198156: Value: [] >> Key: /high/326011: Value: [] >> Key: /high/112050: Value: [] >> Key: /high/306887: Value: [] >> Key: /high/208169: Value: [] >> Key: /high/283464: Value: [] >> Key: /high/168905: Value: [] >> Count: 2548 >> >> >> Running vectordump on the text_vec folder like so: >> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump -i >> /user/sgeadmin/text_vec >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and >> HADOOP_CONF_DIR=/conf >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar >> 12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments: >> {--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec], >> --startPhase=[0], --tempDir=[temp]} >> 12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false >> Exception in thread "main" java.lang.IllegalStateException: >> file:/user/sgeadmin/text_vec/tf-vectors >> at >> >> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63) >> at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at >> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:616) >> at >> >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:616) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:186) >> Caused by: java.io.FileNotFoundException: >> /user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory) >> at java.io.FileInputStream.open(Native Method) >> at java.io.FileInputStream.<init>(FileInputStream.java:137) >> at >> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>( >> RawLocalFileSystem.java:72) >> at >> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>( >> RawLocalFileSystem.java:108) >> at >> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178) >> at >> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>( >> ChecksumFileSystem.java:127) >> at >> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284) >> at >> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452) >> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile >> .java:1431) >> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile >> .java:1424) >> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile >> .java:1419) >> at >> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init >> >(SequenceFileIterator.java:58) >> at >> >> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61) >> ... 15 more >> >> Kind Regards, >> Nilu >> >> On 31 July 2012 23:59, DAN HELM <[email protected]> wrote: >> >> > Folcon, >> > >> > seqdirectory should also read files in subfolders. >> > >> > Did you verify that recent seqdirectory command did in fact generate >> > non-empty sequence files? I believe seqdirectory command just assumes >> > each file contains a single document (no concatenated documents per >> > file), and that each file contains basic text. >> > >> > If it did generate sequence files this time, I am assume your folder >> > "/user/sgeadmin/text_seq" was copied to hdfs (if not already there) >> before >> > you ran seq2sparse on it? >> > >> > Dan >> > >> > *From:* Folcon Red <[email protected]> >> > *To:* DAN HELM <[email protected]> >> > *Cc:* Jake Mannix <[email protected]>; "[email protected]" < >> > [email protected]> >> > *Sent:* Tuesday, July 31, 2012 1:34 PM >> >> > >> > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics >> > >> > So part-r-00000 inside text_vec is >> > still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable >> > even after moving all the training files into a single folder. >> > >> > Regards, >> > Folcon >> > >> > On 31 July 2012 18:18, Folcon Red <[email protected]> wrote: >> > >> > > Hey Everyone, >> > > >> > > Ok not certain why $MAHOUT_HOME/bin/mahout seqdirectory --input >> /user/ >> > > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow >> didn't >> > >> > > produce sequence files, just looking inside text_seq only gives me: >> > > >> > > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text >> > > >> > > and that's it. Any ideas what I've been doing wrong? Maybe it's >> because I >> > > have the files nested in the folder by class, for example a tree view >> of >> > > the directory would look like. >> > > >> > > text_train -+ >> > > | A -+ >> > > | 100 >> > > | 101 >> > > | 103 >> > > | B -+ >> > > | 102 >> > > | 105 >> > > | 106 >> > > >> > > So it's not picking them up? Or perhaps something else? I'm going to >> try >> > > some variations to see what happens. >> > > >> > > Thanks for the help so far! >> > > >> > > Regards, >> > > Folcon >> > > >> > > >> > > On 29 July 2012 22:10, Folcon Red <[email protected]> wrote: >> > > >> > >> Right, well here's something promising, running >> $MAHOUT_HOME/bin/mahout >> > >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced: >> > >> >> > >> >> > >> >> > 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN >> ,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN >> ,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN >> ,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN >> ,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN >> ,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN >> ,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN >> ,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN >> ,29533:NaN,29534:NaN,29535:NaN} >> > >> >> > >> And $MAHOUT_HOME/bin/mahout seqdumper -i >> > >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced: >> > >> >> > >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. >> > >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and >> > >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf >> > >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar >> > >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments: >> > >> {--endPhase=[2147483647], >> > >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000], >> > >> --startPhase=[0], --tempDir=[temp]} >> > >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000 >> > >> Key class: class org.apache.hadoop.io.Text Value Class: class >> > >> org.apache.mahout.math.VectorWritable >> > >> Count: 0 >> > >> >> > >> Kind Regards, >> > >> Folcon >> > >> >> > >> On 29 July 2012 21:29, DAN HELM <[email protected]> wrote: >> > >> >> > >>> Yep something went wrong, most likely with the clustering. part >> file >> > is >> > >>> empty. Should look something like this: >> > >>> >> > >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class >> > >>> org.apache.mahout.math.VectorWritable >> > >>> Key: 0: Value: >> > >>> >> > >> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457} >> > >>> Key: 1: Value: >> > >>> >> > >> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756} >> > >>> Key: 2: Value: >> > >>> >> > >> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6} >> > >>> ... >> > >>> ... >> > >>> >> > >>> Key refers to a document id and the Value are topic ids:weights >> > assigned >> > >>> to document id. >> > >>> >> > >>> So you need to figure out where things went wrong. I'm assume >> folder >> > >>> /user/sgeadmin/text_lda also has empty part files? Assuming parts >> > >>> files are there run seqdumper on one. Should have data like the >> above >> > >>> except in this case the key will be a topic id and the vector will >> be >> > term >> > >>> ids:weights. >> > >>> >> > >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to >> make >> > >>> sure sparse vectors were generated for your input to cvb. >> > >>> >> > >>> Dan >> > >>> >> > >>> *From:* Folcon Red <[email protected]> >> > >>> *To:* DAN HELM <[email protected]> >> > >>> *Cc:* Jake Mannix <[email protected]>; "[email protected]" >> < >> > >>> [email protected]> >> > >>> *Sent:* Sunday, July 29, 2012 3:35 PM >> > >>> >> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's >> topics >> > >> > >>> >> > >>> Thanks Dan and Jake, >> > >>> >> > >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/ >> > >>> sgeadmin/text_cvb_document/part-m-00000 is: >> > >>> >> > >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000 >> > >>> Key class: class org.apache.hadoop.io < >> > http://org.apache.hadoop.io.int/> >> > >> > >>> .IntWritable Value Class: class >> org.apache.mahout.math.VectorWritable >> > >>> Count: 0 >> > >>> >> > >>> I'm not certain what went wrong. >> > >>> >> > >>> Kind Regards, >> > >>> Folcon >> > >>> >> > >>> On 29 July 2012 18:49, DAN HELM <[email protected]> wrote: >> > >>> >> > >>> Folcon, >> > >>> >> > >>> I'm still using Mahout 0.6 so don't know much about changes in 0.7. >> > >>> >> > >>> Your output folder for "dt" looks correct. The relevant data >> would be >> > >>> in /user/sgeadmin/text_cvb_document/part-m-00000 which is what I >> would >> > >>> be passing to a "-s" option. But I see it says size is only 97 so >> that >> > >>> looks suspicious. So you can just view file (for starters) as: >> mahout >> > >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000. And >> the >> > >>> vector dumper command (as Jake pointed out) has a lot more options >> to >> > >>> post-process the data but you may want to first just see what is in >> > >>> that file. >> > >>> >> > >>> Dan >> > >>> >> > >>> *From:* Folcon Red <[email protected]> >> > >>> *To:* Jake Mannix <[email protected]> >> > >>> *Cc:* [email protected]; DAN HELM <[email protected]> >> > >>> *Sent:* Sunday, July 29, 2012 1:08 PM >> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's >> topics >> > >> > >>> >> > >>> Hi Guys, >> > >>> >> > >>> Thanks for replying, the problem is whenever I use any -s flag I get >> > the >> > >>> error "Unexpected -s while processing Job-Specific Options:" >> > >>> >> > >>> Also I'm not sure if this is supposed to be the output of -dt >> > >>> >> > >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document - >> hadoop >> > >>> starcluster >> > >>> Found 3 items >> > >>> -rw-r--r-- 3 sgeadmin supergroup 0 2012-07-29 16:51 >> /user/ >> > >>> sgeadmin/text_cvb_document/_SUCCESS >> > >>> drwxr-xr-x - sgeadmin supergroup 0 2012-07-29 16:50 >> /user/ >> > >>> sgeadmin/text_cvb_document/_logs >> > >>> -rw-r--r-- 3 sgeadmin supergroup 97 2012-07-29 16:51 /user/ >> > >>> sgeadmin/text_cvb_document/part-m-00000 >> > >>> >> > >>> Should I be using a newer version of mahout? I've just been using >> the >> > >>> 0.7 distribution so far as apparently the compiled versions are >> missing >> > >>> parts that the distributed ones have. >> > >>> >> > >>> Kind Regards, >> > >>> Folcon >> > >>> >> > >>> PS: Thanks for the help so far! >> > >>> >> > >>> On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote: >> > >>> >> > >>> >> > >>> >> > >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected] >> > >wrote: >> > >>> >> > >>> Hi Folcon, >> > >>> >> > >>> In the folder you specified for the –dt option for cvb command >> > >>> there should be sequence files with the document to topic >> associations >> > >>> (Key: >> > >>> IntWritable, Value: VectorWritable). >> > >>> >> > >>> >> > >>> Yeah, this is correct, although this: >> > >>> >> > >>> >> > >>> You can dump in text format as: mahout seqdumper –s <sequence file> >> > >>> >> > >>> >> > >>> is not as good as using vectordumper: >> > >>> >> > >>> mahout vectordump -s <sequence file> --dictionary <path to >> > dictionary.file-0> >> > >>> \ >> > >>> --dictionaryType seqfile --vectorSize <num entries per >> topic you >> > >>> want to see> -sort >> > >>> >> > >>> This joins your topic vectors with the dictionary, then picks out >> the >> > >>> top k terms (with their >> > >>> probabilities) for each topic and prints them to the console (or to >> the >> > >>> file you specify with >> > >>> an --output option). >> > >>> >> > >>> *although* I notice now that in trunk when I just checked, >> > VectorDumper.java >> > >>> had a bug >> > >>> in it for "vectorSize" - line 175 asks for cmdline option " >> > >>> numIndexesPerVector" not >> > >>> vectorSize, ack! So I took the liberty of fixing that, but you'll >> need >> > >>> to "svn up" and rebuild >> > >>> your jar before using vectordump like this. >> > >>> >> > >>> >> > >>> So in text output from seqdumper, the key is a document id and the >> > >>> vector contains >> > >>> the topics and associated scores associated with the document. I >> think >> > >>> all topics are listed for each >> > >>> document but many with near zero score. >> > >>> In my case I used rowid to convert keys of original sparse >> > >>> document vectors from Text to Integer before running cvb and this >> > >>> generates a mapping file so I know the textual >> > >>> keys that correspond to the numeric document ids (since my original >> > >>> document ids were file names and I created named vectors). >> > >>> Hope this helps. >> > >>> Dan >> > >>> >> > >>> ________________________________ >> > >>> >> > >>> From: Folcon <[email protected]> >> > >>> To: [email protected] >> > >>> Sent: Saturday, July 28, 2012 8:28 PM >> > >>> Subject: Using Mahout to train an CVB and retrieve it's topics >> > >>> >> > >>> Hi Everyone, >> > >>> >> > >>> I'm posting this as my original message did not seem to appear on >> the >> > >>> mailing >> > >>> list, I'm very sorry if I have done this in error. >> > >>> >> > >>> I'm doing this to then use the topics to train a maxent algorithm >> to >> > >>> predict the >> > >>> classes of documents given their topic mixtures. Any further aid in >> > this >> > >>> direction would be appreciated! >> > >>> >> > >>> I've been trying to extract the topics out of my run of cvb. Here's >> > >>> what I did >> > >>> so far. >> > >>> >> > >>> Ok, so I still don't know how to output the topics, but I have >> worked >> > >>> out how to >> > >>> get the cvb and what I think are the document vectors, however I'm >> not >> > >>> having >> > >>> any luck dumping them, so help here would still be appreciated! >> > >>> >> > >>> I set the values of: >> > >>> export MAHOUT_HOME=/home/sgeadmin/mahout >> > >>> export HADOOP_HOME=/usr/lib/hadoop >> > >>> export JAVA_HOME=/usr/lib/jvm/java-6-openjdk >> > >>> export HADOOP_CONF_DIR=$HADOOP_HOME/conf >> > >>> on the master otherwise none of this works. >> > >>> >> > >>> So first I uploaded the documents using starclusters put: >> > >>> starcluster put mycluster text_train /home/sgeadmin/ >> > >>> starcluster put mycluster text_test /home/sgeadmin/ >> > >>> >> > >>> Then I added them to hadoop's hbase filesystem: >> > >>> dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop >> > >>> starcluster >> > >>> >> > >>> Then I called Mahout's seqdirectory to turn the text into sequence >> > files >> > >>> $MAHOUT_HOME/bin/mahout seqdirectory --input >> > /user/sgeadmin/text_train >> > >>> -- >> > >>> output /user/sgeadmin/text_seq -c UTF-8 -ow >> > >>> >> > >>> Then I called Mahout's seq2parse to turn them into vectors >> > >>> $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin >> > >>> /text_vec - >> > >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow >> > >>> >> > >>> Finally I called cvb, I believe that the -dt flag states where the >> > >>> inferred >> > >>> topics should go, but because I haven't yet been able to dump them I >> > >>> can't >> > >>> confirm this. >> > >>> $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors >> -o >> > >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict >> > >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt >> > /user/sgeadmin/text_cvb_document >> > >>> - >> > >>> mt /user/sgeadmin/text_states >> > >>> >> > >>> The -k flag is the number of topics, the -nt flag is the size of >> the >> > >>> dictionary, >> > >>> I computed this by counting the number of entries of the >> > >>> dictionary.file-0 >> > >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and >> -x >> > >>> is the >> > >>> number of iterations. >> > >>> >> > >>> If you know how to get what the document topic probabilities are >> from >> > >>> here, help >> > >>> would be most appreciated! >> > >>> >> > >>> Kind Regards, >> > >>> Folcon >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> -- >> > >>> >> > >>> -jake >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >> >> > >> >> > > >> > >> > >> > >> > >> >> >> >> > >
