Hi there,

I've been doing some performance testing with hadoop and have been experiencing 
highly variable results which I am trying to understand. I've been examining 
how long it takes to perform a particular MR job, and am finding that the time 
taken varies by a factor of 2 when I repeat the job. Note that the data, 
algorithm, cluster etc is completely the same (and I am the only person on the 
cluster).

The way I do the test is from a simple shell script that just runs the job 
again and again. I find that the job is as fast as 5 mins, but as slow as 10 
mins, with everything in between.

I've examined the output of two log files, where I can see that the performance 
difference is coming from the map and shuffle phases. For a sample 'fast' job, 
the map phases take on average 2 mins 34 secs, whereas for a sample 'slow' jobs 
the phases take on average 4 mins 12 secs. Interestingly, if I then look at the 
counters for random maps (one each from the fast and slow jobs) then I find 
that all counters are pretty much equal – including CPU time. This suggests 
that the slowdown comes from bottlenecks at disk I/O or network. Since I am the 
only user on the network (it's a dedicated GB switch) and the only one using 
the disks, I don't understand what can be happening. Also, the total data is 
not that huge – the job analyses 21GB with replication 2 spread across 8 disks 
on 4 nodes. The total disk output from the reducers is about 300MB. I'm not 
sure how to investigate further – is there some other diagnostic within hadoop 
that can tell me where the code is waiting (e.g. For network or disk I/O) – or 
perhaps some system tool that can indicate performance hits in specific places?

Thanks for any suggestions

Peter


Reply via email to