Re: CVB CPU Utilization

Andy Schlaikjer Mon, 03 Dec 2012 09:03:09 -0800

Hi Markus,

First I'd check to make sure your input term vectors are evenly
partitioned into more than two part files. You can force a certain map
side parallelism by creating a specific number of part files here. No
matter how you configure map slots, you'll need input organized in
such a way that they all receive tasks to process.


Andy
@sagemintblue


On Dec 3, 2012, at 12:44 AM, Markus Paaso <[email protected]> wrote:

> The log shows that there are 2 map tasks and 10 reduce tasks.
> How can there be 10 reduce tasks when I set parameter
> '-Dmapred.tasktracker.reduce.tasks.maximum=7'?
> I would like to increase the amount of concurrent map tasks. Any parameter
> suggestions for that?
>
> It seems that configuration parameter
> 'mapred.tasktracker.map.tasks.maximum' doesn't grow the number of
> concurrently running map tasks...
>
>
> Some log rows from mahout cvb:
>
> 12/12/03 10:30:23 INFO mapred.JobClient: Job complete: job_201212011004_0432
> 12/12/03 10:30:23 INFO mapred.JobClient: Counters: 32
> 12/12/03 10:30:23 INFO mapred.JobClient:   File System Counters
> 12/12/03 10:30:23 INFO mapred.JobClient:     FILE: Number of bytes
> read=8076460
> 12/12/03 10:30:23 INFO mapred.JobClient:     FILE: Number of bytes
> written=18396152
> 12/12/03 10:30:23 INFO mapred.JobClient:     FILE: Number of read
> operations=0
> 12/12/03 10:30:23 INFO mapred.JobClient:     FILE: Number of large read
> operations=0
> 12/12/03 10:30:23 INFO mapred.JobClient:     FILE: Number of write
> operations=0
> 12/12/03 10:30:23 INFO mapred.JobClient:     HDFS: Number of bytes
> read=14054985
> 12/12/03 10:30:23 INFO mapred.JobClient:     HDFS: Number of bytes
> written=4040120
> 12/12/03 10:30:23 INFO mapred.JobClient:     HDFS: Number of read
> operations=166
> 12/12/03 10:30:23 INFO mapred.JobClient:     HDFS: Number of large read
> operations=0
> 12/12/03 10:30:23 INFO mapred.JobClient:     HDFS: Number of write
> operations=91
> 12/12/03 10:30:23 INFO mapred.JobClient:   Job Counters
> 12/12/03 10:30:23 INFO mapred.JobClient:     Launched map tasks=2
> 12/12/03 10:30:23 INFO mapred.JobClient:     Launched reduce tasks=10
> 12/12/03 10:30:23 INFO mapred.JobClient:     Data-local map tasks=2
> 12/12/03 10:30:23 INFO mapred.JobClient:     Total time spent by all maps
> in occupied slots (ms)=456617
> 12/12/03 10:30:23 INFO mapred.JobClient:     Total time spent by all
> reduces in occupied slots (ms)=108715
> 12/12/03 10:30:23 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 12/12/03 10:30:23 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 12/12/03 10:30:23 INFO mapred.JobClient:   Map-Reduce Framework
> 12/12/03 10:30:23 INFO mapred.JobClient:     Map input records=77332
> 12/12/03 10:30:23 INFO mapred.JobClient:     Map output records=100
> 12/12/03 10:30:23 INFO mapred.JobClient:     Map output bytes=8075900
> 12/12/03 10:30:23 INFO mapred.JobClient:     Input split bytes=288
> 12/12/03 10:30:23 INFO mapred.JobClient:     Combine input records=100
> 12/12/03 10:30:23 INFO mapred.JobClient:     Combine output records=100
> 12/12/03 10:30:23 INFO mapred.JobClient:     Reduce input groups=50
> 12/12/03 10:30:23 INFO mapred.JobClient:     Reduce shuffle bytes=8076520
> 12/12/03 10:30:23 INFO mapred.JobClient:     Reduce input records=100
> 12/12/03 10:30:23 INFO mapred.JobClient:     Reduce output records=50
> 12/12/03 10:30:23 INFO mapred.JobClient:     Spilled Records=200
> 12/12/03 10:30:23 INFO mapred.JobClient:     CPU time spent (ms)=570850
> 12/12/03 10:30:23 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=3334303744
> 12/12/03 10:30:23 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=35329503232
> 12/12/03 10:30:23 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=6070009856
>
>
> Cheers, Markus
>
>
> 2012/12/3 Markus Paaso <[email protected]>
>
>> Hi,
>>
>> I have some problems to utilize all available CPU power for 'mahout cvb'
>> command.
>> The CPU usage is just about 35% and IO wait ~0%.
>> I have 8 cores and 28 GB memory in a single computer that is running
>> Mahout 0.7-cdh-4.1.2 with Hadoop 2.0.0-cdh4.1.2 in pseudo-distributed mode.
>> How can I take advantage of all the CPU power for a single 'mahout cvb'
>> task?
>>
>>
>> I use following parameters to run mahout cvb:
>>
>> mahout cvb
>> -Ddfs.namenode.handler.count=32
>> -Dmapred.job.tracker.handler.count=32
>> -Dio.sort.factor=30
>> -Dio.sort.mb=500
>> -Dio.file.buffer.size=65536
>> -Dmapred.child.java.opts=-Xmx2g
>> -Dmapred.map.child.java.opts=-Xmx2g
>> -Dmapred.reduce.child.java.opts=-Xmx2g
>> -Dmapred.job.reuse.jvm.num.tasks=-1
>> -Dmapred.map.tasks=7
>> -Dmapred.reduce.tasks=7
>> -Dmapred.max.split.size=3145728
>> -Dmapred.min.split.size=3145728
>> -Dmapred.tasktracker.map.tasks.maximum=7
>> -Dmapred.tasktracker.reduce.tasks.maximum=7
>> -Dmapred.tasktracker.tasks.maximum=7
>>  --input ~/mahout-files/mydatavectors_int
>>  --output ~/mahout-files/topics
>>  --num_terms 10078
>>  --num_topics 50
>>  --doc_topic_output ~/mahout-files/doc-topics
>>  --maxIter 50
>>  --num_update_threads 8
>>  --num_train_threads 8
>>  -block 1
>>  --test_set_fraction 0.1
>>  --convergenceDelta 0.0000001
>>  --tempDir ~/mahout-files/cvb-temp
>>
>>
>> Linux top command says:
>>
>> Cpu(s): 33.9%us,  1.1%sy,  0.0%ni, 65.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Mem:  28479224k total, 16398624k used, 12080600k free,   899576k buffers
>> Swap: 28942332k total,        0k used, 28942332k free,  5733368k cached
>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 19765 mapred    20   0 2811m 650m  16m S  129  2.3   3:59.06 java
>> 19721 mapred    20   0 2812m 650m  16m S  125  2.3   3:53.70 java
>>
>> So just 2.5 / 8 cores are fully in use.
>>
>>
>> Regards, Markus
>
>
>
> --
> Markus Paaso
> Developer, Sagire Software Oy
> http://sagire.fi/

Re: CVB CPU Utilization

Reply via email to