Dear all, I'm trying to use the LDA framework in Mahout and I'm experiencing some troubles. I saw these tutorials [1,2], and I decided to apply lda to a collection with 1M of tweets to see how it works. I indexed them with lucene as suggested in [2]. Then I discovered that in the last version this is not supported and I had to to use a sequence file. I saw the util 'seqdirectory' in [2] but it's a bit impractical to create one million documents, each one with a tweet. So I wrote a small java app that takes a file where each line is a document and creates a sequence file <Text,Text> containing the id (line number) and the tweet. Then I used seq2sparse util:
./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o /tmp/vector -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow and I created the vectors. (it succeeded without problems) Now, I discovered that lda now it's called cvb (why did you change the name? is a bit confusing.. ) so I tried to run the command, but I got this error java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable (full stack trace here [3]) I also tried the local version: ./bin/mahout cvb0_local -i /tmp/vector/tf-vectors -d /tmp/vector/dictionary.file-0 --numTopics 100 --docOutputFile /tmp/out --topicOutputFile /tmp/topic (why the parameters' names are different???) But i got a similar error: Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String (full stack trace here [4]) Where i'm wrong?? could please help me? Thanks Diego [1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html [2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html [3] http://pastebin.com/nV3T74fe [4] http://pastebin.com/JH1xQHuC
