Hello Dan,
Thank you for giving this reference. I was unable to succeed Mahout
0.0.7 to run LDA so I downgraded to 0.5 to run the LDA and it worked.
May be I should try this.
Vineeth
On 12-10-29 02:02 PM, Diego Ceccarelli wrote:
Thanks Dan, it solved.
On Sun, Oct 28, 2012 at 10:40 PM, DAN HELM <[email protected]> wrote:
Hi Diego,
A number of us had the same issue when first working with the new CVB
algorithm. The vector keys for CVB need to be Integers. You can use the
rowid utility to convert the output from seq2sparse to the form needed by
CVB, e.g.,
http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
Dan
From: Diego Ceccarelli <[email protected]>
To: [email protected]
Sent: Sunday, October 28, 2012 5:21 PM
Subject: Using LDA in Mahout 0.0.7
Dear all,
I'm trying to use the LDA framework in Mahout and I'm experiencing
some troubles.
I saw these tutorials [1,2], and I decided to apply lda to a collection with
1M of tweets to see how it works. I indexed them with lucene as suggested
in [2]. Then I discovered that in the last version this is not supported
and I had to to use a sequence file.
I saw the util 'seqdirectory' in [2] but it's a bit impractical to create
one million documents,
each one with a tweet. So I wrote a small java app that takes a file where
each line
is a document and creates a sequence file <Text,Text> containing the id
(line number)
and the tweet.
Then I used seq2sparse util:
./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o
/tmp/vector -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
and I created the vectors. (it succeeded without problems)
Now, I discovered that lda now it's called cvb (why did you change the name?
is
a bit confusing.. ) so I tried to run the command, but I got this error
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable
(full stack trace here [3])
I also tried the local version:
./bin/mahout cvb0_local -i /tmp/vector/tf-vectors -d
/tmp/vector/dictionary.file-0 --numTopics 100 --docOutputFile /tmp/out
--topicOutputFile /tmp/topic
(why the parameters' names are different???)
But i got a similar error:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer
cannot be cast to java.lang.String
(full stack trace here [4])
Where i'm wrong?? could please help me?
Thanks
Diego
[1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
[2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
[3] http://pastebin.com/nV3T74fe
[4] http://pastebin.com/JH1xQHuC