Thanks Dan, it solved. On Sun, Oct 28, 2012 at 10:40 PM, DAN HELM <[email protected]> wrote: > Hi Diego, > A number of us had the same issue when first working with the new CVB > algorithm. The vector keys for CVB need to be Integers. You can use the > rowid utility to convert the output from seq2sparse to the form needed by > CVB, e.g., > http://comments.gmane.org/gmane.comp.apache.mahout.user/13112 > Dan > > From: Diego Ceccarelli <[email protected]> > To: [email protected] > Sent: Sunday, October 28, 2012 5:21 PM > Subject: Using LDA in Mahout 0.0.7 > > Dear all, > > I'm trying to use the LDA framework in Mahout and I'm experiencing > some troubles. > I saw these tutorials [1,2], and I decided to apply lda to a collection with > 1M of tweets to see how it works. I indexed them with lucene as suggested > in [2]. Then I discovered that in the last version this is not supported > and I had to to use a sequence file. > I saw the util 'seqdirectory' in [2] but it's a bit impractical to create > one million documents, > each one with a tweet. So I wrote a small java app that takes a file where > each line > is a document and creates a sequence file <Text,Text> containing the id > (line number) > and the tweet. > Then I used seq2sparse util: > > ./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o > /tmp/vector -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow > > and I created the vectors. (it succeeded without problems) > > Now, I discovered that lda now it's called cvb (why did you change the name? > is > a bit confusing.. ) so I tried to run the command, but I got this error > > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > org.apache.hadoop.io.IntWritable > (full stack trace here [3]) > > I also tried the local version: > > ./bin/mahout cvb0_local -i /tmp/vector/tf-vectors -d > /tmp/vector/dictionary.file-0 --numTopics 100 --docOutputFile /tmp/out > --topicOutputFile /tmp/topic > > (why the parameters' names are different???) > But i got a similar error: > Exception in thread "main" java.lang.ClassCastException: java.lang.Integer > cannot be cast to java.lang.String > (full stack trace here [4]) > > Where i'm wrong?? could please help me? > Thanks > Diego > > [1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html > [2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html > [3] http://pastebin.com/nV3T74fe > [4] http://pastebin.com/JH1xQHuC > >
-- Computers are useless. They can only give you answers. (Pablo Picasso) _______________ Diego Ceccarelli High Performance Computing Laboratory Information Science and Technologies Institute (ISTI) Italian National Research Council (CNR) Via Moruzzi, 1 56124 - Pisa - Italy Phone: +39 050 315 3055 Fax: +39 050 315 2040 ________________________________________
