Created a new JIRA issue https://issues.apache.org/jira/browse/MAHOUT-1055
-Markus 2012/8/13 Markus Paaso <[email protected]> > Why is IntWritable used as id field in Mahout CVB? > (org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper) > Does Long have that significant impact on performance? > > Long is much more usable as id type and int causes compatibility issues > like the one below. > > > Cheers, Markus > > > > > 2012/8/10 Markus Paaso <[email protected]> > >> Hi >> >> I am using mahout 0.7 with hadoop 0.20.205 and getting >> ClassCastException when running mahout cvb command for lucene vectors. >> It seems like LongWritable is tried to cast to IntWritable. >> >> Is there something I am missing? >> >> >> Regards, Markus >> >> >> >> /opt/mahout/bin/mahout lucene.vector --dir >> /home/markus/workspace/lucene-index --output >> ../mahout-files/vectors/content --field content --dictOut >> ../mahout-files/dictionaries/content --norm 2 --idField personId >> --maxDFPercent 20 --minDF 2 -w TFIDF >> >> /opt/mahout/bin/mahout cvb -D mapred.child.java.opts=-Xmx2048M --input >> ../mahout-files/vectors/content --output ../mahout-files/lda-workdir >> --num_terms 30957 --overwrite --num_topics 20 --doc_topic_output >> ../mahout-files/lda-training --maxIter 10 --tempDir ../mahout-files/lda-temp >> >> >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. >> Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop >> MAHOUT-JOB: /opt/mahout/mahout-examples-0.7-job.jar >> Picked up JAVA_TOOL_OPTIONS: -XX:+UseBiasedLocking >> Picked up JAVA_TOOL_OPTIONS: -XX:+UseBiasedLocking >> 12/08/10 08:18:02 INFO common.AbstractJob: Command line arguments: >> {--convergenceDelta=[0], --doc_topic_output=[../mahout-files/lda-training], >> --doc_t >> opic_smoothing=[0.0001], --endPhase=[2147483647], >> --input=[../mahout-files/vectors/content], --iteration_block_size=[10], >> --maxIter=[10], --max_doc_to >> pic_iters=[10], --num_reduce_tasks=[10], --num_terms=[30957], >> --num_topics=[20], --num_train_threads=[4], --num_update_threads=[1], >> --output=[../mahou >> t-files/lda-workdir], --overwrite=null, --startPhase=[0], >> --tempDir=[../mahout-files/lda-temp], --term_topic_smoothing=[0.0001], >> --test_set_fraction=[ >> 0]} >> 12/08/10 08:18:02 INFO cvb.CVB0Driver: Will run Collapsed Variational >> Bayes (0th-derivative approximation) learning for LDA on >> ../mahout-files/vectors >> /content (numTerms: 30957), finding 20-topics, with document/topic prior >> 1.0E-4, topic/term prior 1.0E-4. Maximum iterations to run will be 10, >> unles >> s the change in perplexity is less than 0.0. Topic model output >> (p(term|topic) for each topic) will be stored ../mahout-files/lda-workdir. >> Random in >> itialization seed is 2871, holding out 0.0 of the data for perplexity >> check >> >> 12/08/10 08:18:02 INFO cvb.CVB0Driver: p(topic|docId) will be stored >> ../mahout-files/lda-training >> >> 12/08/10 08:18:02 INFO cvb.CVB0Driver: Current iteration number: 0 >> 12/08/10 08:18:02 INFO cvb.CVB0Driver: About to run iteration 1 of 10 >> 12/08/10 08:18:02 INFO cvb.CVB0Driver: About to run: Iteration 1 of 10, >> input path: ../mahout-files/lda-temp/topicModelState/model-0 >> 12/08/10 08:18:02 INFO util.NativeCodeLoader: Loaded the native-hadoop >> library >> 12/08/10 08:18:02 INFO input.FileInputFormat: Total input paths to >> process : 1 >> 12/08/10 08:18:03 INFO mapred.JobClient: Running job: job_local_0001 >> 12/08/10 08:18:03 INFO util.ProcessTree: setsid exited with exit code 0 >> 12/08/10 08:18:03 INFO mapred.Task: Using ResourceCalculatorPlugin : >> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1a1399 >> 12/08/10 08:18:03 INFO mapred.MapTask: io.sort.mb = 10 >> 12/08/10 08:18:03 INFO mapred.MapTask: data buffer = 7969177/9961472 >> 12/08/10 08:18:03 INFO mapred.MapTask: record buffer = 26214/32768 >> 12/08/10 08:18:03 INFO zlib.ZlibFactory: Successfully loaded & >> initialized native-zlib library >> 12/08/10 08:18:03 INFO compress.CodecPool: Got brand-new decompressor >> 12/08/10 08:18:03 INFO cvb.CachingCVB0Mapper: Retrieving configuration >> 12/08/10 08:18:03 INFO cvb.CachingCVB0Mapper: Initializing read model >> 12/08/10 08:18:03 INFO cvb.CachingCVB0Mapper: No model files found >> 12/08/10 08:18:03 INFO cvb.CachingCVB0Mapper: Initializing write model >> 12/08/10 08:18:03 INFO cvb.CachingCVB0Mapper: Initializing model trainer >> 12/08/10 08:18:03 INFO cvb.ModelTrainer: Starting training threadpool >> with 4 threads >> 12/08/10 08:18:03 WARN mapred.LocalJobRunner: job_local_0001 >> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be >> cast to org.apache.hadoop.io.IntWritable >> at >> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) >> 12/08/10 08:18:04 INFO mapred.JobClient: map 0% reduce 0% >> 12/08/10 08:18:04 INFO mapred.JobClient: Job complete: job_local_0001 >> 12/08/10 08:18:04 INFO mapred.JobClient: Counters: 0 >> Exception in thread "main" java.lang.InterruptedException: Failed to >> complete iteration 1 stage 1 >> at >> org.apache.mahout.clustering.lda.cvb.CVB0Driver.runIteration(CVB0Driver.java:518) >> at >> org.apache.mahout.clustering.lda.cvb.CVB0Driver.run(CVB0Driver.java:304) >> at >> org.apache.mahout.clustering.lda.cvb.CVB0Driver.run(CVB0Driver.java:187) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at >> org.apache.mahout.clustering.lda.cvb.CVB0Driver.main(CVB0Driver.java:550) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> > -- Markus Paaso Developer, Sagire Software Oy http://sagire.fi/
