kmeans local vs mapreduce difference

Mihai Josan Thu, 25 Apr 2013 02:01:42 -0700

Hi,

I'm running a kmeans clusterization on a small sequence (around 50 KB) file on 
a 2 node cluster. The block size for this file is 20 KB, so it uses 3 mappers
I am using CDH4.2.0 with yarn and Mahout 0.7
If the job runs local on only one node the used CPU is around 20% and the job 
finishes in 30 seconds with 11 iterations.
If the job runs on both nodes, the used CPU is around 80, 90% on one node at a 
time and the job finishes in 15 minutes with 30 iterations (maximum set by 
parameter).


the command is:
mahout kmeans \
-i $Path/${SEQUENCE} \
-c $Path/${CENTROIDS_IN} \
-o $Path/${CLUSTERS_OUT} \
-x 30 \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure \
-ow \
-cd 0.5

last iteration on local:
Cluster Iterator running iteration 11 over priorPath: 
**/clusters-out/clusters-10
13/04/24 16:42:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with 
processName=JobTracker, sessionId= - already initialized
13/04/24 16:42:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
13/04/24 16:42:32 INFO input.FileInputFormat: Total input paths to process : 1
13/04/24 16:42:32 INFO mapred.JobClient: Running job: job_local103137904_0011
13/04/24 16:42:32 INFO mapred.LocalJobRunner: OutputCommitter set in config null
13/04/24 16:42:32 INFO mapred.LocalJobRunner: OutputCommitter is 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Waiting for map tasks
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: 
attempt_local103137904_0011_m_000000_0
13/04/24 16:42:32 WARN mapreduce.Counters: Group 
org.apache.hadoop.mapred.Task$Counter is deprecated. Use 
org.apache.hadoop.mapreduce.TaskCounter instead
13/04/24 16:42:32 INFO mapred.Task:  Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@603b88a8
13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:0+20480
13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100
13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720
13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.MapTask: Starting flush of map output
13/04/24 16:42:32 INFO mapred.MapTask: Finished spill 0
13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_m_000000_0 
is done. And is in the process of commiting
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.Task: Task 
'attempt_local103137904_0011_m_000000_0' done.
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local103137904_0011_m_000000_0
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: 
attempt_local103137904_0011_m_000001_0
13/04/24 16:42:32 WARN mapreduce.Counters: Group 
org.apache.hadoop.mapred.Task$Counter is deprecated. Use 
org.apache.hadoop.mapreduce.TaskCounter instead
13/04/24 16:42:32 INFO mapred.Task:  Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3183a45e
13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:20480+20480
13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100
13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720
13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.MapTask: Starting flush of map output
13/04/24 16:42:32 INFO mapred.MapTask: Finished spill 0
13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_m_000001_0 
is done. And is in the process of commiting
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.Task: Task 
'attempt_local103137904_0011_m_000001_0' done.
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local103137904_0011_m_000001_0
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: 
attempt_local103137904_0011_m_000002_0
13/04/24 16:42:32 WARN mapreduce.Counters: Group 
org.apache.hadoop.mapred.Task$Counter is deprecated. Use 
org.apache.hadoop.mapreduce.TaskCounter instead
13/04/24 16:42:32 INFO mapred.Task:  Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3273d766
13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:40960+10127
13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100
13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720
13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.MapTask: Starting flush of map output
13/04/24 16:42:32 INFO mapred.MapTask: Finished spill 0
13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_m_000002_0 
is done. And is in the process of commiting
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.Task: Task 
'attempt_local103137904_0011_m_000002_0' done.
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local103137904_0011_m_000002_0
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Map task executor complete.
13/04/24 16:42:32 WARN mapreduce.Counters: Group 
org.apache.hadoop.mapred.Task$Counter is deprecated. Use 
org.apache.hadoop.mapreduce.TaskCounter instead
13/04/24 16:42:32 INFO mapred.Task:  Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4f05c2f
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.Merger: Merging 3 sorted segments
13/04/24 16:42:32 INFO mapred.Merger: Down to the last merge-pass, with 3 
segments left of total size: 67116 bytes
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_r_000000_0 
is done. And is in the process of commiting
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.Task: Task attempt_local103137904_0011_r_000000_0 
is allowed to commit now
13/04/24 16:42:32 INFO output.FileOutputCommitter: Saved output of task 
'attempt_local103137904_0011_r_000000_0' to **/clusters-out/clusters-11
13/04/24 16:42:32 INFO mapred.LocalJobRunner: reduce > reduce
13/04/24 16:42:32 INFO mapred.Task: Task 
'attempt_local103137904_0011_r_000000_0' done.
13/04/24 16:42:33 INFO mapred.JobClient:  map 100% reduce 100%
13/04/24 16:42:33 INFO mapred.JobClient: Job complete: job_local103137904_0011
13/04/24 16:42:33 INFO mapred.JobClient: Counters: 25
13/04/24 16:42:33 INFO mapred.JobClient:   File System Counters
13/04/24 16:42:33 INFO mapred.JobClient:     FILE: Number of bytes read=69353609
13/04/24 16:42:33 INFO mapred.JobClient:     FILE: Number of bytes 
written=74174160
13/04/24 16:42:33 INFO mapred.JobClient:     FILE: Number of read operations=0
13/04/24 16:42:33 INFO mapred.JobClient:     FILE: Number of large read 
operations=0
13/04/24 16:42:33 INFO mapred.JobClient:     FILE: Number of write operations=0
13/04/24 16:42:33 INFO mapred.JobClient:     HDFS: Number of bytes read=4607820
13/04/24 16:42:33 INFO mapred.JobClient:     HDFS: Number of bytes 
written=561859
13/04/24 16:42:33 INFO mapred.JobClient:     HDFS: Number of read 
operations=2002
13/04/24 16:42:33 INFO mapred.JobClient:     HDFS: Number of large read 
operations=0
13/04/24 16:42:33 INFO mapred.JobClient:     HDFS: Number of write 
operations=353
13/04/24 16:42:33 INFO mapred.JobClient:   Map-Reduce Framework
13/04/24 16:42:33 INFO mapred.JobClient:     Map input records=311
13/04/24 16:42:33 INFO mapred.JobClient:     Map output records=15
13/04/24 16:42:33 INFO mapred.JobClient:     Map output bytes=67050
13/04/24 16:42:33 INFO mapred.JobClient:     Input split bytes=450
13/04/24 16:42:33 INFO mapred.JobClient:     Combine input records=0
13/04/24 16:42:33 INFO mapred.JobClient:     Combine output records=0
13/04/24 16:42:33 INFO mapred.JobClient:     Reduce input groups=5
13/04/24 16:42:33 INFO mapred.JobClient:     Reduce shuffle bytes=0
13/04/24 16:42:33 INFO mapred.JobClient:     Reduce input records=15
13/04/24 16:42:33 INFO mapred.JobClient:     Reduce output records=5
13/04/24 16:42:33 INFO mapred.JobClient:     Spilled Records=30
13/04/24 16:42:33 INFO mapred.JobClient:     CPU time spent (ms)=0
13/04/24 16:42:33 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0
13/04/24 16:42:33 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
13/04/24 16:42:33 INFO mapred.JobClient:     Total committed heap usage 
(bytes)=2079719424
13/04/24 16:42:33 INFO driver.MahoutDriver: Program took 16259 ms (Minutes: 
0.27098333333333335)

last iteration on mapred:
Cluster Iterator running iteration 30 over priorPath: 
**/clusters-out/clusters-29
13/04/24 16:57:58 INFO service.AbstractService: 
Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/04/24 16:57:58 INFO service.AbstractService: 
Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/04/24 16:57:59 INFO input.FileInputFormat: Total input paths to process : 1
13/04/24 16:57:59 INFO mapreduce.JobSubmitter: number of splits:3
13/04/24 16:57:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1363860755474_2534
13/04/24 16:57:59 INFO client.YarnClientImpl: Submitted application 
application_1363860755474_2534 to ResourceManager at **
13/04/24 16:57:59 INFO mapreduce.Job: The url to track the job: 
**/application_1363860755474_2534/
13/04/24 16:57:59 INFO mapreduce.Job: Running job: job_1363860755474_2534
13/04/24 16:58:08 INFO mapreduce.Job: Job job_1363860755474_2534 running in 
uber mode : false
13/04/24 16:58:08 INFO mapreduce.Job:  map 0% reduce 0%
13/04/24 16:58:22 INFO mapreduce.Job:  map 100% reduce 0%
13/04/24 16:58:29 INFO mapreduce.Job:  map 100% reduce 100%
13/04/24 16:58:29 INFO mapreduce.Job: Job job_1363860755474_2534 completed 
successfully
13/04/24 16:58:29 INFO mapreduce.Job: Counters: 43
                File System Counters
                                FILE: Number of bytes read=67116
                                FILE: Number of bytes written=412109
                                FILE: Number of read operations=0
                                FILE: Number of large read operations=0
                                FILE: Number of write operations=0
                                HDFS: Number of bytes read=101792
                                HDFS: Number of bytes written=11783
                                HDFS: Number of read operations=35
                                HDFS: Number of large read operations=0
                                HDFS: Number of write operations=2
                Job Counters
                                Launched map tasks=3
                                Launched reduce tasks=1
                                Rack-local map tasks=3
                                Total time spent by all maps in occupied slots 
(ms)=36628
                                Total time spent by all reduces in occupied 
slots (ms)=5028
                Map-Reduce Framework
                                Map input records=311
                                Map output records=15
                                Map output bytes=67050
                                Map output materialized bytes=67128
                                Input split bytes=450
                                Combine input records=0
                                Combine output records=0
                                Reduce input groups=5
                                Reduce shuffle bytes=67128
                                Reduce input records=15
                                Reduce output records=5
                                Spilled Records=30
                                Shuffled Maps =3
                                Failed Shuffles=0
                                Merged Map outputs=4
                                GC time elapsed (ms)=220
                                CPU time spent (ms)=5360
                                Physical memory (bytes) snapshot=962498560
                                Virtual memory (bytes) snapshot=2208133120
                                Total committed heap usage (bytes)=764805120
                Shuffle Errors
                                BAD_ID=0
                                CONNECTION=0
                                IO_ERROR=0
                                WRONG_LENGTH=0
                                WRONG_MAP=0
                                WRONG_REDUCE=0
                File Input Format Counters
                                Bytes Read=53434
                File Output Format Counters
                                Bytes Written=11783
13/04/24 16:58:29 INFO driver.MahoutDriver: Program took 890034 ms (Minutes: 
14.8339)

Is this a normal behavior when using Mahout with a small amount of data or the 
cluster might have a bad configuration?
My hopes were that running kmeans on mapreduce with 2 nodes will improve the 
performance.
Do you have any ideas what will improve the mapreduce performance?

Thank you,
Mihai

kmeans local vs mapreduce difference

Reply via email to