Re: Using Mahout to train an CVB and retrieve it's topics

Folcon Red Sun, 05 Aug 2012 15:39:14 -0700

Hi Dan,

I've managed to get the text_seq and text_vec generated properly, however
when I run:


$MAHOUT_HOME/bin/mahout cvb -i /user/root/text_vec/tf-vectors -o
/user/root/text_lda -k 100 -nt 29536 -x 20 -dict
/user/root/text_vec/dictionary.file-0 -dt /user/root/text_cvb_document -
mt /user/root/text_states

I get:

12/08/05 21:18:04 INFO mapred.JobClient: Task Id :
attempt_201208051752_0002_m_000003_1, Status : FAILED
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable
at
org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

Task attempt_201208051752_0002_m_000003_1 failed to report status for 600
seconds. Killing!

Any ideas what's causing this?

Thank you for all the help so far!

Kind Regards,
Folcon

On 2 August 2012 02:41, Folcon Red <[email protected]> wrote:

> Thanks Dan,
>
> Ok, now for some strange reason it(seq and vec appear to have values now,
> will test the complete cvb later, I should head to bed...) appears to be
> working, The only things I think I changed was I stopped using absolute
> paths(referring to text_seq as opposed to /user/root/text_seq) and I'm
> using root now instead of sgeadmin.
>
> Regards,
> Folcon
>
>
> On 1 August 2012 03:00, DAN HELM <[email protected]> wrote:
>
>> Hi Folcon,
>>
>> There is no reason to rerun seq2sparse as it is clear something is wrongwith 
>> the text files being processed by
>> seqdirectory command.
>>
>> Based on the keys, I'm assuming the files full path to the input files
>> are names like /high/59734, etc.  Did you look inside the files to make
>> sure there is text in them?
>>
>> As a test, just create a folder with a simple text file and run that
>> through seqdirectory and I'll bet you will then see output from 
>> seqdumpercommand (from
>> seqdirectory output).
>>
>> Thanks, Dan
>>
>>    *From:* Folcon Red <[email protected]>
>> *To:* DAN HELM <[email protected]>
>> *Cc:* "[email protected]" <[email protected]>
>> *Sent:* Tuesday, July 31, 2012 7:28 PM
>>
>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>>
>> Hi Dan,
>>
>> It's good to know that seqdirectory reads files in subfolders and I've
>> dumped out some of the values in the hopes that they will be
>> enlightening, The values seem to be missing for both the text_seq and
>> the tokenized-documents.
>>
>> So rerunning some of the commands:
>> $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
>> --output /user/sgeadmin/text_seq -c UTF-8 -ow
>> $MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o
>> /user/sgeadmin/text_vec -wt tf -a
>> org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>
>> And then doing a seqdumper of text_seq:
>> SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
>> [...]
>> Key: /high/59734: Value:
>> Key: /high/264596: Value:
>> Key: /high/341699: Value:
>> Key: /high/260770: Value:
>> Key: /high/222320: Value:
>> Key: /high/198156: Value:
>> Key: /high/326011: Value:
>> Key: /high/112050: Value:
>> Key: /high/306887: Value:
>> Key: /high/208169: Value:
>> Key: /high/283464: Value:
>> Key: /high/168905: Value:
>> Count: 2548
>>
>> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper -i
>> /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> HADOOP_CONF_DIR=/conf
>> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> 12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=[2147483647],
>> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
>> --startPhase=[0], --tempDir=[temp]}
>> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Count: 0
>>
>> $MAHOUT_HOME/bin/mahout seqdumper -i
>> /user/sgeadmin/text_vec/tokenized-documents/part-m-00000
>> [...]
>> Key: /high/396063: Value: []
>> Key: /high/230246: Value: []
>> Key: /high/136284: Value: []
>> Key: /high/59734: Value: []
>> Key: /high/264596: Value: []
>> Key: /high/341699: Value: []
>> Key: /high/260770: Value: []
>> Key: /high/222320: Value: []
>> Key: /high/198156: Value: []
>> Key: /high/326011: Value: []
>> Key: /high/112050: Value: []
>> Key: /high/306887: Value: []
>> Key: /high/208169: Value: []
>> Key: /high/283464: Value: []
>> Key: /high/168905: Value: []
>> Count: 2548
>>
>>
>> Running vectordump on the text_vec folder like so:
>> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump -i
>> /user/sgeadmin/text_vec
>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> HADOOP_CONF_DIR=/conf
>> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> 12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec],
>> --startPhase=[0], --tempDir=[temp]}
>> 12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false
>> Exception in thread "main" java.lang.IllegalStateException:
>> file:/user/sgeadmin/text_vec/tf-vectors
>> at
>>
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
>> at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at
>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:616)
>> at
>>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:616)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>> Caused by: java.io.FileNotFoundException:
>> /user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory)
>> at java.io.FileInputStream.open(Native Method)
>> at java.io.FileInputStream.<init>(FileInputStream.java:137)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(
>> RawLocalFileSystem.java:72)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(
>> RawLocalFileSystem.java:108)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(
>> ChecksumFileSystem.java:127)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452)
>> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
>> .java:1431)
>> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
>> .java:1424)
>> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
>> .java:1419)
>> at
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init
>> >(SequenceFileIterator.java:58)
>> at
>>
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
>> ... 15 more
>>
>> Kind Regards,
>> Nilu
>>
>> On 31 July 2012 23:59, DAN HELM <[email protected]> wrote:
>>
>> > Folcon,
>> >
>> > seqdirectory should also read files in subfolders.
>> >
>> > Did you verify that recent seqdirectory command did in fact generate
>> > non-empty sequence files?  I believe seqdirectory command just assumes
>> > each file contains a single document (no concatenated documents per
>> > file), and that each file contains basic text.
>> >
>> > If it did generate sequence files this time, I am assume your folder
>> > "/user/sgeadmin/text_seq" was copied to hdfs (if not already there)
>> before
>> > you ran seq2sparse on it?
>> >
>> > Dan
>> >
>> >    *From:* Folcon Red <[email protected]>
>> > *To:* DAN HELM <[email protected]>
>> > *Cc:* Jake Mannix <[email protected]>; "[email protected]" <
>> > [email protected]>
>> > *Sent:* Tuesday, July 31, 2012 1:34 PM
>>
>> >
>> > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>> >
>> > So part-r-00000 inside text_vec is
>> > still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
>> > even after moving all the training files into a single folder.
>> >
>> > Regards,
>> > Folcon
>> >
>> > On 31 July 2012 18:18, Folcon Red <[email protected]> wrote:
>> >
>> > > Hey Everyone,
>> > >
>> > > Ok not certain why  $MAHOUT_HOME/bin/mahout seqdirectory --input
>> /user/
>> > > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow
>> didn't
>> >
>> > > produce sequence files, just looking inside text_seq only gives me:
>> > >
>> > > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
>> > >
>> > > and that's it. Any ideas what I've been doing wrong? Maybe it's
>> because I
>> > > have the files nested in the folder by class, for example a tree view
>> of
>> > > the directory would look like.
>> > >
>> > > text_train -+
>> > >                | A -+
>> > >                        | 100
>> > >                        | 101
>> > >                        | 103
>> > >                | B -+
>> > >                        | 102
>> > >                        | 105
>> > >                        | 106
>> > >
>> > > So it's not picking them up? Or perhaps something else? I'm going to
>> try
>> > > some variations to see what happens.
>> > >
>> > > Thanks for the help so far!
>> > >
>> > > Regards,
>> > > Folcon
>> > >
>> > >
>> > > On 29 July 2012 22:10, Folcon Red <[email protected]> wrote:
>> > >
>> > >> Right, well here's something promising, running
>> $MAHOUT_HOME/bin/mahout
>> > >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
>> > >>
>> > >>
>> > >>
>> > 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN
>> ,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN
>> ,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN
>> ,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN
>> ,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN
>> ,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN
>> ,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN
>> ,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN
>> ,29533:NaN,29534:NaN,29535:NaN}
>> > >>
>> > >> And $MAHOUT_HOME/bin/mahout seqdumper -i
>> > >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
>> > >>
>> > >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> > >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> > >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
>> > >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> > >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
>> > >> {--endPhase=[2147483647],
>> > >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
>> > >> --startPhase=[0], --tempDir=[temp]}
>> > >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> > >> Key class: class org.apache.hadoop.io.Text Value Class: class
>> > >> org.apache.mahout.math.VectorWritable
>> > >> Count: 0
>> > >>
>> > >> Kind Regards,
>> > >> Folcon
>> > >>
>> > >> On 29 July 2012 21:29, DAN HELM <[email protected]> wrote:
>> > >>
>> > >>> Yep something went wrong, most likely with the clustering.  part
>> file
>> > is
>> > >>> empty.  Should look something like this:
>> > >>>
>> > >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>> > >>> org.apache.mahout.math.VectorWritable
>> > >>> Key: 0: Value:
>> > >>>
>> >
>> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
>> > >>> Key: 1: Value:
>> > >>>
>> >
>> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
>> > >>> Key: 2: Value:
>> > >>>
>> >
>> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
>> > >>> ...
>> > >>> ...
>> > >>>
>> > >>> Key refers to a document id and the Value are topic ids:weights
>> > assigned
>> > >>> to document id.
>> > >>>
>> > >>> So you need to figure out where things went wrong.  I'm assume
>> folder
>> > >>> /user/sgeadmin/text_lda also has empty part files?  Assuming parts
>> > >>> files are there run seqdumper on one.  Should have data like the
>> above
>> > >>> except in this case the key will be a topic id and the vector will
>> be
>> > term
>> > >>> ids:weights.
>> > >>>
>> > >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to
>> make
>> > >>> sure sparse vectors were generated for your input to cvb.
>> > >>>
>> > >>> Dan
>> > >>>
>> > >>>    *From:* Folcon Red <[email protected]>
>> > >>> *To:* DAN HELM <[email protected]>
>> > >>> *Cc:* Jake Mannix <[email protected]>; "[email protected]"
>> <
>> > >>> [email protected]>
>> > >>> *Sent:* Sunday, July 29, 2012 3:35 PM
>> > >>>
>> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's
>> topics
>> >
>> > >>>
>> > >>> Thanks Dan and Jake,
>> > >>>
>> > >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
>> > >>> sgeadmin/text_cvb_document/part-m-00000 is:
>> > >>>
>> > >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
>> > >>> Key class: class org.apache.hadoop.io <
>> > http://org.apache.hadoop.io.int/>
>> >
>> > >>> .IntWritable Value Class: class
>> org.apache.mahout.math.VectorWritable
>> > >>> Count: 0
>> > >>>
>> > >>> I'm not certain what went wrong.
>> > >>>
>> > >>> Kind Regards,
>> > >>> Folcon
>> > >>>
>> > >>> On 29 July 2012 18:49, DAN HELM <[email protected]> wrote:
>> > >>>
>> > >>> Folcon,
>> > >>>
>> > >>> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
>> > >>>
>> > >>> Your output folder for "dt" looks correct.  The relevant data
>> would be
>> > >>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I
>> would
>> > >>> be passing to a "-s" option.  But I see it says size is only 97 so
>> that
>> > >>> looks suspicious.  So you can just view file (for starters) as:
>> mahout
>> > >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And
>> the
>> > >>> vector dumper command (as Jake pointed out) has a lot more options
>> to
>> > >>> post-process the data but you may want to first just see what is in
>> > >>> that file.
>> > >>>
>> > >>> Dan
>> > >>>
>> > >>>    *From:* Folcon Red <[email protected]>
>> > >>> *To:* Jake Mannix <[email protected]>
>> > >>> *Cc:* [email protected]; DAN HELM <[email protected]>
>> > >>> *Sent:* Sunday, July 29, 2012 1:08 PM
>> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's
>> topics
>> >
>> > >>>
>> > >>> Hi Guys,
>> > >>>
>> > >>> Thanks for replying, the problem is whenever I use any -s flag I get
>> > the
>> > >>> error "Unexpected -s while processing Job-Specific Options:"
>> > >>>
>> > >>> Also I'm not sure if this is supposed to be the output of -dt
>> > >>>
>> > >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -
>> hadoop
>> > >>> starcluster
>> > >>> Found 3 items
>> > >>> -rw-r--r--  3 sgeadmin supergroup          0 2012-07-29 16:51
>> /user/
>> > >>> sgeadmin/text_cvb_document/_SUCCESS
>> > >>> drwxr-xr-x  - sgeadmin supergroup          0 2012-07-29 16:50
>> /user/
>> > >>> sgeadmin/text_cvb_document/_logs
>> > >>> -rw-r--r--  3 sgeadmin supergroup        97 2012-07-29 16:51 /user/
>> > >>> sgeadmin/text_cvb_document/part-m-00000
>> > >>>
>> > >>> Should I be using a newer version of mahout? I've just been using
>> the
>> > >>> 0.7 distribution so far as apparently the compiled versions are
>> missing
>> > >>> parts that the distributed ones have.
>> > >>>
>> > >>> Kind Regards,
>> > >>> Folcon
>> > >>>
>> > >>> PS: Thanks for the help so far!
>> > >>>
>> > >>> On 29 July 2012 04:52, Jake Mannix <[email protected]> wrote:
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[email protected]
>> > >wrote:
>> > >>>
>> > >>> Hi Folcon,
>> > >>>
>> > >>> In the folder you specified for the –dt option for cvb command
>> > >>> there should be sequence files with the document to topic
>> associations
>> > >>> (Key:
>> > >>> IntWritable, Value: VectorWritable).
>> > >>>
>> > >>>
>> > >>> Yeah, this is correct, although this:
>> > >>>
>> > >>>
>> > >>> You can dump in text format as: mahout seqdumper –s <sequence file>
>> > >>>
>> > >>>
>> > >>> is not as good as using vectordumper:
>> > >>>
>> > >>>    mahout vectordump -s <sequence file> --dictionary <path to
>> > dictionary.file-0>
>> > >>> \
>> > >>>        --dictionaryType seqfile --vectorSize <num entries per
>> topic you
>> > >>> want to see> -sort
>> > >>>
>> > >>> This joins your topic vectors with the dictionary, then picks out
>> the
>> > >>> top k terms (with their
>> > >>> probabilities) for each topic and prints them to the console (or to
>> the
>> > >>> file you specify with
>> > >>> an --output option).
>> > >>>
>> > >>> *although* I notice now that in trunk when I just checked,
>> > VectorDumper.java
>> > >>> had a bug
>> > >>> in it for "vectorSize" - line 175 asks for cmdline option "
>> > >>> numIndexesPerVector" not
>> > >>> vectorSize, ack!  So I took the liberty of fixing that, but you'll
>> need
>> > >>> to "svn up" and rebuild
>> > >>> your jar before using vectordump like this.
>> > >>>
>> > >>>
>> > >>>  So in text output from seqdumper, the key is a document id and the
>> > >>> vector contains
>> > >>> the topics and associated scores associated with the document.  I
>> think
>> > >>> all topics are listed for each
>> > >>> document but many with near zero score.
>> > >>> In my case I used rowid to convert keys of original sparse
>> > >>> document vectors from Text to Integer before running cvb and this
>> > >>> generates a mapping file so I know the textual
>> > >>> keys that correspond to the numeric document ids (since my original
>> > >>> document ids were file names and I created named vectors).
>> > >>> Hope this helps.
>> > >>> Dan
>> > >>>
>> > >>> ________________________________
>> > >>>
>> > >>>  From: Folcon <[email protected]>
>> > >>> To: [email protected]
>> > >>> Sent: Saturday, July 28, 2012 8:28 PM
>> > >>> Subject: Using Mahout to train an CVB and retrieve it's topics
>> > >>>
>> > >>> Hi Everyone,
>> > >>>
>> > >>> I'm posting this as my original message did not seem to appear on
>> the
>> > >>> mailing
>> > >>> list, I'm very sorry if I have done this in error.
>> > >>>
>> > >>> I'm doing this to then use the topics to train a maxent algorithm
>> to
>> > >>> predict the
>> > >>> classes of documents given their topic mixtures. Any further aid in
>> > this
>> > >>> direction would be appreciated!
>> > >>>
>> > >>> I've been trying to extract the topics out of my run of cvb. Here's
>> > >>> what I did
>> > >>> so far.
>> > >>>
>> > >>> Ok, so I still don't know how to output the topics, but I have
>> worked
>> > >>> out how to
>> > >>> get the cvb and what I think are the document vectors, however I'm
>> not
>> > >>> having
>> > >>> any luck dumping them, so help here would still be appreciated!
>> > >>>
>> > >>> I set the values of:
>> > >>>    export MAHOUT_HOME=/home/sgeadmin/mahout
>> > >>>    export HADOOP_HOME=/usr/lib/hadoop
>> > >>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>> > >>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
>> > >>> on the master otherwise none of this works.
>> > >>>
>> > >>> So first I uploaded the documents using starclusters put:
>> > >>>    starcluster put mycluster text_train /home/sgeadmin/
>> > >>>    starcluster put mycluster text_test /home/sgeadmin/
>> > >>>
>> > >>> Then I added them to hadoop's hbase filesystem:
>> > >>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
>> > >>> starcluster
>> > >>>
>> > >>> Then I called Mahout's seqdirectory to turn the text into sequence
>> > files
>> > >>>    $MAHOUT_HOME/bin/mahout seqdirectory --input
>> > /user/sgeadmin/text_train
>> > >>> --
>> > >>> output /user/sgeadmin/text_seq -c UTF-8 -ow
>> > >>>
>> > >>> Then I called Mahout's seq2parse to turn them into vectors
>> > >>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin
>> > >>> /text_vec -
>> > >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>> > >>>
>> > >>> Finally I called cvb, I believe that the -dt flag states where the
>> > >>> inferred
>> > >>> topics should go, but because I haven't yet been able to dump them I
>> > >>> can't
>> > >>> confirm this.
>> > >>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors
>> -o
>> > >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
>> > >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
>> > /user/sgeadmin/text_cvb_document
>> > >>> -
>> > >>> mt /user/sgeadmin/text_states
>> > >>>
>> > >>> The -k flag is the number of topics, the -nt flag is the size of
>> the
>> > >>> dictionary,
>> > >>> I computed this by counting the number of entries of the
>> > >>> dictionary.file-0
>> > >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and
>> -x
>> > >>> is the
>> > >>> number of iterations.
>> > >>>
>> > >>> If you know how to get what the document topic probabilities are
>> from
>> > >>> here, help
>> > >>> would be most appreciated!
>> > >>>
>> > >>> Kind Regards,
>> > >>> Folcon
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>>
>> > >>>  -jake
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Reply via email to