Hi Markus, First I'd check to make sure your input term vectors are evenly partitioned into more than two part files. You can force a certain map side parallelism by creating a specific number of part files here. No matter how you configure map slots, you'll need input organized in such a way that they all receive tasks to process.
Andy @sagemintblue On Dec 3, 2012, at 12:44 AM, Markus Paaso <[email protected]> wrote: > The log shows that there are 2 map tasks and 10 reduce tasks. > How can there be 10 reduce tasks when I set parameter > '-Dmapred.tasktracker.reduce.tasks.maximum=7'? > I would like to increase the amount of concurrent map tasks. Any parameter > suggestions for that? > > It seems that configuration parameter > 'mapred.tasktracker.map.tasks.maximum' doesn't grow the number of > concurrently running map tasks... > > > Some log rows from mahout cvb: > > 12/12/03 10:30:23 INFO mapred.JobClient: Job complete: job_201212011004_0432 > 12/12/03 10:30:23 INFO mapred.JobClient: Counters: 32 > 12/12/03 10:30:23 INFO mapred.JobClient: File System Counters > 12/12/03 10:30:23 INFO mapred.JobClient: FILE: Number of bytes > read=8076460 > 12/12/03 10:30:23 INFO mapred.JobClient: FILE: Number of bytes > written=18396152 > 12/12/03 10:30:23 INFO mapred.JobClient: FILE: Number of read > operations=0 > 12/12/03 10:30:23 INFO mapred.JobClient: FILE: Number of large read > operations=0 > 12/12/03 10:30:23 INFO mapred.JobClient: FILE: Number of write > operations=0 > 12/12/03 10:30:23 INFO mapred.JobClient: HDFS: Number of bytes > read=14054985 > 12/12/03 10:30:23 INFO mapred.JobClient: HDFS: Number of bytes > written=4040120 > 12/12/03 10:30:23 INFO mapred.JobClient: HDFS: Number of read > operations=166 > 12/12/03 10:30:23 INFO mapred.JobClient: HDFS: Number of large read > operations=0 > 12/12/03 10:30:23 INFO mapred.JobClient: HDFS: Number of write > operations=91 > 12/12/03 10:30:23 INFO mapred.JobClient: Job Counters > 12/12/03 10:30:23 INFO mapred.JobClient: Launched map tasks=2 > 12/12/03 10:30:23 INFO mapred.JobClient: Launched reduce tasks=10 > 12/12/03 10:30:23 INFO mapred.JobClient: Data-local map tasks=2 > 12/12/03 10:30:23 INFO mapred.JobClient: Total time spent by all maps > in occupied slots (ms)=456617 > 12/12/03 10:30:23 INFO mapred.JobClient: Total time spent by all > reduces in occupied slots (ms)=108715 > 12/12/03 10:30:23 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > 12/12/03 10:30:23 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 12/12/03 10:30:23 INFO mapred.JobClient: Map-Reduce Framework > 12/12/03 10:30:23 INFO mapred.JobClient: Map input records=77332 > 12/12/03 10:30:23 INFO mapred.JobClient: Map output records=100 > 12/12/03 10:30:23 INFO mapred.JobClient: Map output bytes=8075900 > 12/12/03 10:30:23 INFO mapred.JobClient: Input split bytes=288 > 12/12/03 10:30:23 INFO mapred.JobClient: Combine input records=100 > 12/12/03 10:30:23 INFO mapred.JobClient: Combine output records=100 > 12/12/03 10:30:23 INFO mapred.JobClient: Reduce input groups=50 > 12/12/03 10:30:23 INFO mapred.JobClient: Reduce shuffle bytes=8076520 > 12/12/03 10:30:23 INFO mapred.JobClient: Reduce input records=100 > 12/12/03 10:30:23 INFO mapred.JobClient: Reduce output records=50 > 12/12/03 10:30:23 INFO mapred.JobClient: Spilled Records=200 > 12/12/03 10:30:23 INFO mapred.JobClient: CPU time spent (ms)=570850 > 12/12/03 10:30:23 INFO mapred.JobClient: Physical memory (bytes) > snapshot=3334303744 > 12/12/03 10:30:23 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=35329503232 > 12/12/03 10:30:23 INFO mapred.JobClient: Total committed heap usage > (bytes)=6070009856 > > > Cheers, Markus > > > 2012/12/3 Markus Paaso <[email protected]> > >> Hi, >> >> I have some problems to utilize all available CPU power for 'mahout cvb' >> command. >> The CPU usage is just about 35% and IO wait ~0%. >> I have 8 cores and 28 GB memory in a single computer that is running >> Mahout 0.7-cdh-4.1.2 with Hadoop 2.0.0-cdh4.1.2 in pseudo-distributed mode. >> How can I take advantage of all the CPU power for a single 'mahout cvb' >> task? >> >> >> I use following parameters to run mahout cvb: >> >> mahout cvb >> -Ddfs.namenode.handler.count=32 >> -Dmapred.job.tracker.handler.count=32 >> -Dio.sort.factor=30 >> -Dio.sort.mb=500 >> -Dio.file.buffer.size=65536 >> -Dmapred.child.java.opts=-Xmx2g >> -Dmapred.map.child.java.opts=-Xmx2g >> -Dmapred.reduce.child.java.opts=-Xmx2g >> -Dmapred.job.reuse.jvm.num.tasks=-1 >> -Dmapred.map.tasks=7 >> -Dmapred.reduce.tasks=7 >> -Dmapred.max.split.size=3145728 >> -Dmapred.min.split.size=3145728 >> -Dmapred.tasktracker.map.tasks.maximum=7 >> -Dmapred.tasktracker.reduce.tasks.maximum=7 >> -Dmapred.tasktracker.tasks.maximum=7 >> --input ~/mahout-files/mydatavectors_int >> --output ~/mahout-files/topics >> --num_terms 10078 >> --num_topics 50 >> --doc_topic_output ~/mahout-files/doc-topics >> --maxIter 50 >> --num_update_threads 8 >> --num_train_threads 8 >> -block 1 >> --test_set_fraction 0.1 >> --convergenceDelta 0.0000001 >> --tempDir ~/mahout-files/cvb-temp >> >> >> Linux top command says: >> >> Cpu(s): 33.9%us, 1.1%sy, 0.0%ni, 65.0%id, 0.0%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Mem: 28479224k total, 16398624k used, 12080600k free, 899576k buffers >> Swap: 28942332k total, 0k used, 28942332k free, 5733368k cached >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 19765 mapred 20 0 2811m 650m 16m S 129 2.3 3:59.06 java >> 19721 mapred 20 0 2812m 650m 16m S 125 2.3 3:53.70 java >> >> So just 2.5 / 8 cores are fully in use. >> >> >> Regards, Markus > > > > -- > Markus Paaso > Developer, Sagire Software Oy > http://sagire.fi/
