Re: cvb vectordump

Jake Mannix Fri, 19 Apr 2013 10:12:17 -0700

Ok this output looks more like what you should be getting, in the sense
that there are some common words with similar global frequencies toward the
beginning of the lists, but the question as to why you only got 2 topics is
very odd.


Can you list the contents of the cvb-output directories, and the commands
you gave to vectordump them?

Running vectordump over the doc-topic output will be strange unless you
leave off the dictionary part, so that you can just list the topicIds with
weights for each document (but even then, you'll want to be able to know
what each of those topics really are about, which is what we're working on
getting for you now).


On Fri, Apr 19, 2013 at 6:58 AM, Chris Harrington <[email protected]>wrote:

> That output was from the cvb-topic-doc
>
> Just ran it over the cvb-output and got
>
>
> {how:0.04017873894220033,you:0.02145053718356672,your:0.015258172557645477,open:0.013308164155725721,use:0.012769472622541729,search:0.011104235407141033,web:0.011054257715475113,up:0.008681792828858947,do:0.007301515628267762,install:0.00724558241579199}
>
> {ruby:0.013622365167924151,search:0.009151768347011088,ucd:0.009024998353413487,street:0.008153757865114717,information:0.008010912906951214,symfony2:0.007610946012031929,college:0.007488257104426453,get:0.006831742925644331,form:0.006798612982669133,us:0.006420604369821055}
>
> There was only the 2 rows  as above, so going by your last mail it only
> found 2 topics?
>
> If so how come it only output 2 topics when I gave it -k 20 as a parameter
>
>
>
> On 19 Apr 2013, at 13:55, David LaBarbera wrote:
>
> > Did you run vectordump with the lda output directory (cvb-output in your
> case) or document topic output (cvb-topic-doc)?
> > Depending on which you're looking at, you'll have
> >
> > lda output:
> > each row corresponds to a topic and the elements are (term
> index:probability). The terms correspond to what's in the dictionary
> (contentDataDir/sparseVectors/dictionary.file-0). You can add the
> dictionary to the command line, so the output will be (term:probability).
> The flag should be
> > --dictionary ./contentDataDir/sparseVectors/dictionary.file-0
> --dictionaryType sequencefile
> >
> > dt output:
> > each row is a document and the elements are (topic:probability).
> >
> >
> > David
> >
> > On Apr 19, 2013, at 8:30 AM, Chris Harrington <[email protected]>
> wrote:
> >
> >> Just ran vectordump over the output from cub but I have no idea what
> I'm looking at
> >>
> >>
> {1.0:0.0689751034234147,0hu:0.052798138507741114,06:0.046108327846619585,091:0.04079964524901706,1:0.03488226667358313,10g:0.03471651100042406,07:0.03051583712303273,10.30am:0.029957963431693112,1171:0.028424194208528646,10.4.10:0.028173810240271588}
> >>
> >> Can someone give me an explanation of the above
> >>
> >> In the Mahout in Action book there was a table which displayed topic
> with top terms, how would I go from the above to something like that. i.e.
> >> topic 0 -> term1, term2 term3….termN
> >> topic 1 -> term1, term2 term3….termN
> >> etc.
> >>
> >>
> >> On 19 Apr 2013, at 10:19, Chris Harrington wrote:
> >>
> >>> Found the issue it was the folder I gave it for outputting the matrix
> in the rowed command, for cvb I gave it the  ./contentDataDir/matrix as the
> matrix location instead I should have supplied
> ./contentDataDir/martrix/matrix
> >>>
> >>> On 17 Apr 2013, at 12:46, Chris Harrington wrote:
> >>>
> >>>> So I've got 0.8 now but I'm running into an error,
> >>>>
> >>>> ../../workspace2/trunk/bin/mahout seqdirectory -i
> ./contentDataDir/output-content-segment -o ./contentDataDir/sequenced
> >>>>
> >>>> ../../workspace2/trunk/bin/mahout seq2sparse -i
> ./contentDataDir/sequenced -o ./contentDataDir/sparseVectors --namedVector
> -wt tf
> >>>>
> >>>> ../../workspace2/trunk/bin/mahout rowid -i
> ./contentDataDir/sparseVectors/tf-vectors/ -o ./contentDataDir/matrix
> >>>>
> >>>> ../../workspace2/trunk/bin/mahout cvb -i ./contentDataDir/matrix -o
> cvb-output -k 100 -x 1 -dict
> ./contentDataDir/sparseVectors/dictionary.file-0 -dt cvb-topic-doc -mt
> cvb-topic-model
> >>>>
> >>>> but the cvb command hits a class cast exception
> >>>>
> >>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> cast to org.apache.mahout.math.VectorWritable
> >>>>    at
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
> >>>>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >>>>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >>>>    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >>>>    at java.security.AccessController.doPrivileged(Native Method)
> >>>>    at javax.security.auth.Subject.doAs(Subject.java:396)
> >>>>    at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> >>>>    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >>>>
> >>>> I thought the seq2sparse took care of turning hadoop Text into
> mahouts VectorWritable. Where have I gone wrong?
> >>>>
> >>>>
> >>>>
> >>>> On 16 Apr 2013, at 14:45, Jake Mannix wrote:
> >>>>
> >>>>> You should just be building off of trunk (0.8-snapshot) in which
> case you
> >>>>> should be working just fine.
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 16, 2013 at 6:43 AM, Chris Harrington <
> [email protected]>wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I've been trying to get the vector dumper to work on the output
> from cub
> >>>>>> but it's throwing lots of errors.
> >>>>>>
> >>>>>> I found several old mails on the mailing list regrading this issue
> >>>>>> specifically this
> >>>>>>
> >>>>>>
> >>>>>>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3CCAHSfFsy2oWRuzwVzGW57LRYaJ+LuudNu-W5EO0wnV_ff=uy...@mail.gmail.com%3E
> >>>>>>
> >>>>>> That thread is a bit old so I was wondering was there a patch or
> anything
> >>>>>> to fix it or do I need to use the 0.8-snapshot?
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> -jake
> >>>>
> >>>
> >>
> >
>
>


-- 

  -jake

Re: cvb vectordump

Reply via email to