Hi Diego, A number of us had the same issue when first working with the new CVB algorithm. The vector keys for CVB need to be Integers. You can use the rowid utility to convert the output from seq2sparse to the form needed by CVB, e.g., http://comments.gmane.org/gmane.comp.apache.mahout.user/13112 Dan
________________________________ From: Diego Ceccarelli <[email protected]> To: [email protected] Sent: Sunday, October 28, 2012 5:21 PM Subject: Using LDA in Mahout 0.0.7 Dear all, I'm trying to use the LDA framework in Mahout and I'm experiencing some troubles. I saw these tutorials [1,2], and I decided to apply lda to a collection with 1M of tweets to see how it works. I indexed them with lucene as suggested in [2]. Then I discovered that in the last version this is not supported and I had to to use a sequence file. I saw the util 'seqdirectory' in [2] but it's a bit impractical to create one million documents, each one with a tweet. So I wrote a small java app that takes a file where each line is a document and creates a sequence file <Text,Text> containing the id (line number) and the tweet. Then I used seq2sparse util: ./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o /tmp/vector -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow and I created the vectors. (it succeeded without problems) Now, I discovered that lda now it's called cvb (why did you change the name? is a bit confusing.. ) so I tried to run the command, but I got this error java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable (full stack trace here [3]) I also tried the local version: ./bin/mahout cvb0_local -i /tmp/vector/tf-vectors -d /tmp/vector/dictionary.file-0 --numTopics 100 --docOutputFile /tmp/out --topicOutputFile /tmp/topic (why the parameters' names are different???) But i got a similar error: Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String (full stack trace here [4]) Where i'm wrong?? could please help me? Thanks Diego [1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html [2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html [3] http://pastebin.com/nV3T74fe [4] http://pastebin.com/JH1xQHuC
