Hi, I'm running a kmeans clusterization on a small sequence (around 50 KB) file on a 2 node cluster. The block size for this file is 20 KB, so it uses 3 mappers I am using CDH4.2.0 with yarn and Mahout 0.7 If the job runs local on only one node the used CPU is around 20% and the job finishes in 30 seconds with 11 iterations. If the job runs on both nodes, the used CPU is around 80, 90% on one node at a time and the job finishes in 15 minutes with 30 iterations (maximum set by parameter).
the command is: mahout kmeans \ -i $Path/${SEQUENCE} \ -c $Path/${CENTROIDS_IN} \ -o $Path/${CLUSTERS_OUT} \ -x 30 \ -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure \ -ow \ -cd 0.5 last iteration on local: Cluster Iterator running iteration 11 over priorPath: **/clusters-out/clusters-10 13/04/24 16:42:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 13/04/24 16:42:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/04/24 16:42:32 INFO input.FileInputFormat: Total input paths to process : 1 13/04/24 16:42:32 INFO mapred.JobClient: Running job: job_local103137904_0011 13/04/24 16:42:32 INFO mapred.LocalJobRunner: OutputCommitter set in config null 13/04/24 16:42:32 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Waiting for map tasks 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: attempt_local103137904_0011_m_000000_0 13/04/24 16:42:32 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 13/04/24 16:42:32 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@603b88a8 13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:0+20480 13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100 13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720 13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.MapTask: Starting flush of map output 13/04/24 16:42:32 INFO mapred.MapTask: Finished spill 0 13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_m_000000_0 is done. And is in the process of commiting 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.Task: Task 'attempt_local103137904_0011_m_000000_0' done. 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Finishing task: attempt_local103137904_0011_m_000000_0 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: attempt_local103137904_0011_m_000001_0 13/04/24 16:42:32 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 13/04/24 16:42:32 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3183a45e 13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:20480+20480 13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100 13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720 13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.MapTask: Starting flush of map output 13/04/24 16:42:32 INFO mapred.MapTask: Finished spill 0 13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_m_000001_0 is done. And is in the process of commiting 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.Task: Task 'attempt_local103137904_0011_m_000001_0' done. 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Finishing task: attempt_local103137904_0011_m_000001_0 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: attempt_local103137904_0011_m_000002_0 13/04/24 16:42:32 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 13/04/24 16:42:32 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3273d766 13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:40960+10127 13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100 13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720 13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.MapTask: Starting flush of map output 13/04/24 16:42:32 INFO mapred.MapTask: Finished spill 0 13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_m_000002_0 is done. And is in the process of commiting 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.Task: Task 'attempt_local103137904_0011_m_000002_0' done. 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Finishing task: attempt_local103137904_0011_m_000002_0 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Map task executor complete. 13/04/24 16:42:32 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 13/04/24 16:42:32 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4f05c2f 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.Merger: Merging 3 sorted segments 13/04/24 16:42:32 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 67116 bytes 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_r_000000_0 is done. And is in the process of commiting 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.Task: Task attempt_local103137904_0011_r_000000_0 is allowed to commit now 13/04/24 16:42:32 INFO output.FileOutputCommitter: Saved output of task 'attempt_local103137904_0011_r_000000_0' to **/clusters-out/clusters-11 13/04/24 16:42:32 INFO mapred.LocalJobRunner: reduce > reduce 13/04/24 16:42:32 INFO mapred.Task: Task 'attempt_local103137904_0011_r_000000_0' done. 13/04/24 16:42:33 INFO mapred.JobClient: map 100% reduce 100% 13/04/24 16:42:33 INFO mapred.JobClient: Job complete: job_local103137904_0011 13/04/24 16:42:33 INFO mapred.JobClient: Counters: 25 13/04/24 16:42:33 INFO mapred.JobClient: File System Counters 13/04/24 16:42:33 INFO mapred.JobClient: FILE: Number of bytes read=69353609 13/04/24 16:42:33 INFO mapred.JobClient: FILE: Number of bytes written=74174160 13/04/24 16:42:33 INFO mapred.JobClient: FILE: Number of read operations=0 13/04/24 16:42:33 INFO mapred.JobClient: FILE: Number of large read operations=0 13/04/24 16:42:33 INFO mapred.JobClient: FILE: Number of write operations=0 13/04/24 16:42:33 INFO mapred.JobClient: HDFS: Number of bytes read=4607820 13/04/24 16:42:33 INFO mapred.JobClient: HDFS: Number of bytes written=561859 13/04/24 16:42:33 INFO mapred.JobClient: HDFS: Number of read operations=2002 13/04/24 16:42:33 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/04/24 16:42:33 INFO mapred.JobClient: HDFS: Number of write operations=353 13/04/24 16:42:33 INFO mapred.JobClient: Map-Reduce Framework 13/04/24 16:42:33 INFO mapred.JobClient: Map input records=311 13/04/24 16:42:33 INFO mapred.JobClient: Map output records=15 13/04/24 16:42:33 INFO mapred.JobClient: Map output bytes=67050 13/04/24 16:42:33 INFO mapred.JobClient: Input split bytes=450 13/04/24 16:42:33 INFO mapred.JobClient: Combine input records=0 13/04/24 16:42:33 INFO mapred.JobClient: Combine output records=0 13/04/24 16:42:33 INFO mapred.JobClient: Reduce input groups=5 13/04/24 16:42:33 INFO mapred.JobClient: Reduce shuffle bytes=0 13/04/24 16:42:33 INFO mapred.JobClient: Reduce input records=15 13/04/24 16:42:33 INFO mapred.JobClient: Reduce output records=5 13/04/24 16:42:33 INFO mapred.JobClient: Spilled Records=30 13/04/24 16:42:33 INFO mapred.JobClient: CPU time spent (ms)=0 13/04/24 16:42:33 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 13/04/24 16:42:33 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 13/04/24 16:42:33 INFO mapred.JobClient: Total committed heap usage (bytes)=2079719424 13/04/24 16:42:33 INFO driver.MahoutDriver: Program took 16259 ms (Minutes: 0.27098333333333335) last iteration on mapred: Cluster Iterator running iteration 30 over priorPath: **/clusters-out/clusters-29 13/04/24 16:57:58 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited. 13/04/24 16:57:58 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started. 13/04/24 16:57:59 INFO input.FileInputFormat: Total input paths to process : 1 13/04/24 16:57:59 INFO mapreduce.JobSubmitter: number of splits:3 13/04/24 16:57:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1363860755474_2534 13/04/24 16:57:59 INFO client.YarnClientImpl: Submitted application application_1363860755474_2534 to ResourceManager at ** 13/04/24 16:57:59 INFO mapreduce.Job: The url to track the job: **/application_1363860755474_2534/ 13/04/24 16:57:59 INFO mapreduce.Job: Running job: job_1363860755474_2534 13/04/24 16:58:08 INFO mapreduce.Job: Job job_1363860755474_2534 running in uber mode : false 13/04/24 16:58:08 INFO mapreduce.Job: map 0% reduce 0% 13/04/24 16:58:22 INFO mapreduce.Job: map 100% reduce 0% 13/04/24 16:58:29 INFO mapreduce.Job: map 100% reduce 100% 13/04/24 16:58:29 INFO mapreduce.Job: Job job_1363860755474_2534 completed successfully 13/04/24 16:58:29 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=67116 FILE: Number of bytes written=412109 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=101792 HDFS: Number of bytes written=11783 HDFS: Number of read operations=35 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=3 Launched reduce tasks=1 Rack-local map tasks=3 Total time spent by all maps in occupied slots (ms)=36628 Total time spent by all reduces in occupied slots (ms)=5028 Map-Reduce Framework Map input records=311 Map output records=15 Map output bytes=67050 Map output materialized bytes=67128 Input split bytes=450 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=67128 Reduce input records=15 Reduce output records=5 Spilled Records=30 Shuffled Maps =3 Failed Shuffles=0 Merged Map outputs=4 GC time elapsed (ms)=220 CPU time spent (ms)=5360 Physical memory (bytes) snapshot=962498560 Virtual memory (bytes) snapshot=2208133120 Total committed heap usage (bytes)=764805120 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=53434 File Output Format Counters Bytes Written=11783 13/04/24 16:58:29 INFO driver.MahoutDriver: Program took 890034 ms (Minutes: 14.8339) Is this a normal behavior when using Mahout with a small amount of data or the cluster might have a bad configuration? My hopes were that running kmeans on mapreduce with 2 nodes will improve the performance. Do you have any ideas what will improve the mapreduce performance? Thank you, Mihai