Dear Sean, At the end, yesterday I solved : I removed the hadoop-core dependency from the main pom, but the problem was that examples module depends also on classes in hadoop-core/hadoop-common, but hadoop common was not in used in examples/pom.xml. I was able to compile adding this dependency in examples/pom.xml (and also hadoop-mapreduce-client-core).
Anyway this did not solve the problem it was simpler :) when I call cvb: bin/mahout cvb -i /user/diegolo/twitter/tweets-rowid -o /user/diegolo/twitter/text_lda -k 100 -dict /user/diegolo/twitter/dictionary.file-0 --maxIter 20 I put as input the output of rowid, while cvb was expecting the matrix inside rowid output ( /user/diegolo/twitter/tweets-rowid/matrix) bin/mahout cvb -i /user/diegolo/twitter/tweets-rowid/matrix -o /user/diegolo/twitter/text_lda -k 100 -dict /user/diegolo/twitter/dictionary.file-0 --maxIter 20 made hadoop happy :) now I've my output and I'm trying to understand it, I've some problems with vector dump, it seems that: ./bin/mahout vectordump -i lda/part-m-00000 -o prob --dictionary vector/dictionary.file-0 -dt sequencefile creates a file where for each topic I have the probability for each term to be in the topic. I would like to see the most probable terms per topic: ./bin/mahout vectordump -i ~/twitter-lda/lda/part-m-00000 -o ~/twitter-lda/prob -d ~/twitter-lda/vector/dictionary.file-0 -dt sequencefile -sort true -vs 20 but i always got this error: (also with really small vector sizes) SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 12/11/01 00:57:08 INFO common.AbstractJob: Command line arguments: {--dictionary=[/Users/diego/twitter-lda/vector/dictionary.file-0], --dictionaryType=[sequencefile], --endPhase=[2147483647], --input=[/Users/diego/twitter-lda/lda/part-m-00000], --output=[/Users/diego/twitter-lda/prob], --sortVectors=[true], --startPhase=[0], --tempDir=[temp], --vectorSize=[20]} 2012-11-01 00:57:08.827 java[10552:1203] Unable to load realm info from SCDynamicStore 12/11/01 00:57:09 INFO vectors.VectorDumper: Sort? true Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108) at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221) at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218) at org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84) do you know this issue? also, I don't undestand how to see the topics for a given tweet. Thanks, Diego On Tue, Oct 30, 2012 at 12:44 PM, Sean Owen <[email protected]> wrote: > If you want to use Hadoop 0.23, there is no point in specifying 0.22 (a > mostly abandoned branch), or 0.20 (an old version of the stable branch, but > something I thought you didn't want to use for some reason). So I would > simply stop bothering with any of that. Don't use SNAPSHOTs of anything. > > examples / integration depend on core, but if core works, they should work. > You have to 'mvn install' your core artifact locally to make it use it. > Your error may be caused by that. > > Why do you want to use 0.23 in the first place? 1.1.x and 2.0.x are the > best stable / experimental branches now. > > On Tue, Oct 30, 2012 at 11:27 AM, Diego Ceccarelli < > [email protected]> wrote: > >> Thanks Sean, >> >> So I first tried commenting the hadoop-core dependency but it did not work, >> then I added a different version for hadoop-core (0.22.0-SNAPSHOT) >> and I was able to compile the mahout core ( mvn -P hadoop-0.23 install >> -DskipTests) >> I had errors with the integration and examples modules (and it >> seems that I need to compile also them to run mahout). (integration >> [1]), (examples errors: [2]) >> >> So I set hadoop-core version to 0.20.2, and I was able to compile >> everything except >> the integration module (which I excluded from the reactor). >> When I run mahout anyway I got the same initial error. >> So I used hadoop-core 0.22.0-SNAPSHOT and I compiled >> separately mahout examples with the 0.20.2 version >> >> Then I tried to run lda on my twitter dataset: >> >> bin/mahout cvb -i /user/diegolo/twitter/tweets-rowid -o >> /user/diegolo/twitter/text_lda -k 100 -dict >> /user/diegolo/twitter/dictionary.file-0 --maxIter 20 >> >> The job started but I got this error: >> >> >> 12/10/30 11:19:44 INFO mapreduce.Job: Running job: job_1351559192903_4948 >> 12/10/30 11:19:55 INFO mapreduce.Job: Job job_1351559192903_4948 >> running in uber mode : false >> 12/10/30 11:19:55 INFO mapreduce.Job: map 0% reduce 0% >> 12/10/30 11:20:07 INFO mapreduce.Job: Task Id : >> attempt_1351559192903_4948_m_000001_0, Status : FAILED >> Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot >> be cast to org.apache.mahout.math.VectorWritable >> at >> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:725) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212) >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152) >> >> do you think is due to the dirty mix I did? why bin/mahout needs the >> folder examples? >> >> Thanks, >> Diego >> >> >> [1] http://pastebin.com/q6VsSAFB >> [2] http://pastebin.com/YvcegjBZ >> >> On Mon, Oct 29, 2012 at 11:20 PM, Sean Owen <[email protected]> wrote: >> > I haven't tried it, I don't know if it works. From reading the pom.xml it >> > looks like it should not consider hadoop-core a dependency if you select >> > the other profile. If not, I don't know why. You could always just delete >> > all the hadoop-core bits and do away with the alternate profile, that >> would >> > work. >> > >> > On Mon, Oct 29, 2012 at 10:07 PM, Diego Ceccarelli < >> > [email protected]> wrote: >> > >> >> > But, most of all note that you are not looking for hadoop-core but >> >> > hadoop-common >> >> >> >> Sorry, but it's 11 pm here and I'm bit tired ;) I don't understand the >> >> above sentence: >> >> in the main pom.xml hadoop-core and hadoop-common are imported with the >> >> same >> >> placeholder $hadoop.version, and the problem that I have is that i >> >> can't compile >> >> because maven does not find the version 0.23.3/4 of hadoop-core. >> >> You are telling me that I have to exclude hadoop core? or to use an >> >> older version >> >> for the core? >> >> Sorry again :( >> >> >> >> cheers >> >> Diego >> >> >> >> >> >> >> >> -- >> Computers are useless. They can only give you answers. >> (Pablo Picasso) >> _______________ >> Diego Ceccarelli >> High Performance Computing Laboratory >> Information Science and Technologies Institute (ISTI) >> Italian National Research Council (CNR) >> Via Moruzzi, 1 >> 56124 - Pisa - Italy >> >> Phone: +39 050 315 3055 >> Fax: +39 050 315 2040 >> ________________________________________ >> -- Computers are useless. They can only give you answers. (Pablo Picasso) _______________ Diego Ceccarelli High Performance Computing Laboratory Information Science and Technologies Institute (ISTI) Italian National Research Council (CNR) Via Moruzzi, 1 56124 - Pisa - Italy Phone: +39 050 315 3055 Fax: +39 050 315 2040 ________________________________________
