Dear all,

I'm trying to use the LDA framework in Mahout and I'm experiencing
some troubles. 
I saw these tutorials [1,2], and I decided to apply lda to a collection with
1M of tweets to see how it works. I indexed them with lucene as suggested
in [2]. Then I discovered that in the last version this is not supported
and I had to to use a sequence file. 
I saw the util 'seqdirectory' in [2] but it's a bit impractical to create one 
million documents,
each one with a tweet. So I wrote a small java app that takes a file where each 
line 
is a document and creates a sequence file  <Text,Text>  containing the id (line 
number) 
and the tweet. 
Then  I used seq2sparse util:

./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o 
/tmp/vector -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow

and I created the vectors. (it succeeded without problems)

Now, I discovered that lda now it's called cvb (why did you change the name? is 
a bit confusing.. ) so I tried to run the command, but I got this error
 
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
org.apache.hadoop.io.IntWritable
(full stack trace here [3])

I also tried the local version:

./bin/mahout cvb0_local -i /tmp/vector/tf-vectors   -d 
/tmp/vector/dictionary.file-0 --numTopics 100 --docOutputFile /tmp/out 
--topicOutputFile /tmp/topic

(why the parameters' names are different???) 
But i got a similar error:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer 
cannot be cast to java.lang.String
(full stack trace here [4])

Where i'm wrong?? could please help me? 
Thanks 
Diego

[1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
[2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
[3] http://pastebin.com/nV3T74fe
[4] http://pastebin.com/JH1xQHuC

Reply via email to