Jie, Simple answer - I got lucky (though obviously there are thing you need to have in place to allow you to be lucky).
Before running the upgrade I ran a set of tests to baseline the cluster performance, e.g. terasort, gridmix and some operational jobs. Terasort by itself isn't very realistic as a cluster test but it's nice and simple to run and is good for regression testing things after a change. After the upgrade the intention was to run the same tests and show that the performance hadn't degraded (improved would have been nice but not worse was the minimum). When we ran the terasort we found that performance was about 50% worse - execution time had gone from 40 minutes to 60 minutes. As I've said, terasort doesn't provide a realistic view of operational performance but this showed that something major had changed and we needed to understand it before going further. So how to go about diagnosing this ... First rule - understand what you're trying to achieve. It's very easy to say performance isn't good enough but performance can always be better so you need to know what's realistic and at what point you're going to stop tuning things. I had a previous baseline that I was trying to match so I knew what I was trying to achieve. Next thing to do is profile your job and identify where the problem is. We had the full job history from the before and after jobs and comparing these we saw that map performance was fairly consistent as were the reduce sort and reduce phases. The problem was with the shuffle, which had gone from 20 minutes pre-upgrade to 40 minutes afterwards. The important thing here is to make sure you've got as much information as possible. If we'd just kept the overall job time then there would have been a lot more areas to look at but knowing the problem was with shuffle allowed me to focus effort in this area. So what had changed in the shuffle that may have slowed things down. The first thing we thought of was that we'd moved from a tarball deployment to using the RPM so what effect might this have had on things. Our operational configuration compresses the map output and in the past we've had problems with Java compression libraries being used rather than native ones and this has affected performance. We knew the RPM deployment had moved the native library so spent some time confirming to ourselves that these were being used correctly (but this turned out to not be the problem). We then spent time doing some process and server profiling - using dstat to look at the server bottlenecks and jstack/jmap to check what the task tracker and reduce processes were doing. Although not directly relevant to this particular problem doing this was useful just to get my head around what Hadoop is doing at various points of the process. The next bit was one place where I got lucky - I happened to be logged onto one of the worker nodes when a test job was running and I noticed that there weren't any reduce tasks running on the server. This was odd as we'd submitted more reducers than we have servers so I'd expected at least one task to be running on each server. Checking the job tracker log file it turned out that since the upgrade the job tracker had been submitting reduce tasks to only 10% of the available nodes. A different 10% each time the job was run so clearly the individual task trackers were working OK but there was something odd going on with the task allocation. Checking the job tracker log file showed that before the upgrade tasks had been fairly evenly distributed so something had changed. After that it was a case of digging around the source code to find out which classes were available for task allocation and what inside them had changed. This can be quite daunting but if you're comfortable with Java then it's just a case of following the calls through the code. Once I found the cause it was just a case of working out what my options were for working around it (in this case turning off the multiple assignment option - I can work out whether I want to turn it back on in slower time). Where I think we got very lucky is that we hit this problem. The configuration we use for the terasort has just over 1 reducer per worker node rather than maxing out the available reducer slots. This decision was made several years and I can't remember the reasons for it. If we'd been using a larger number of reducers then the number of worker nodes in use would have been similar regardless of the allocation algorithm and so the performance would have looked similar before and after the upgrade. We would have hit this problem eventually but probably not until we started running user jobs and by then it would be too late to do the intrusive investigations that were possible now. Hope this has been useful. Regards, Jon On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote: > Jon: > > This is interesting and helpful! How did you figure out the cause? And how > much time did you spend? Could you share some experience of performance > diagnosis? > > Jie > > On Tuesday, November 27, 2012, Harsh J wrote: > >> Hi Amit, >> >> The default scheduler is FIFO, and may not work for all forms of >> workloads. Read the multiple schedulers available to see if they have >> features that may benefit your workload: >> >> Capacity Scheduler: >> http://hadoop.apache.org/docs/stable/capacity_scheduler.html >> FairScheduler: >> http://hadoop.apache.org/docs/stable/fair_scheduler.html >> >> While there's a good overlap of features between them, there are a few >> differences that set them apart and make them each useful for >> different use-cases. If I had to summarize on some such differences, >> FairScheduler is better suited to SLA form of job execution situations >> due to its preemptive features (which make it useful in user and >> service mix scenarios), while CapacityScheduler provides >> manual-resource-request oriented scheduling for odd jobs with high >> memory workloads, etc. (which make it useful for running certain >> specific kind of jobs along side the regular ones). >> >> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote: >> > So this is a FairScheduler problem ? >> > We are using the default Hadoop scheduler. Is there a reason to use the >> Fair >> > Scheduler if most of the time we don't have more than 4 jobs running >> > simultaneously ? >> > >> > >> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote: >> >> >> >> Hi Amit, >> >> >> >> He means the mapred.fairscheduler.assignmultiple FairScheduler >> >> property. It is true by default, which works well for most workloads >> >> if not benchmark style workloads. I would not usually trust that as a >> >> base perf. measure of everything that comes out of an upgrade. >> >> >> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0. >> >> >> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> >> wrote: >> >> > Hi Jon, >> >> > >> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop >> >> > 1.0.4 >> >> > and I haven't noticed any performance issues. By "multiple >> assignment >> >> > feature" do you mean speculative execution >> >> > (mapred.map.tasks.speculative.execution and >> >> > mapred.reduce.tasks.speculative.execution) ? >> >> > >> >> > >> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <jayaye...@gmail.com> >> wrote: >> >> >> >> >> >> Problem solved, but worth warning others about. >> >> >> >> >> >> Before the upgrade the reducers for the terasort process had been >> >> >> evenly >> >> >> distributed around the cluster - one per task tracker in turn, >> looping >> >> >> around the cluster until all tasks were allocated. After the >> upgrade >> >> >> all >> >> >> reduce task had been submitted to small number of task trackers - >> >> >> submit >> >> >> tasks until the task tracker slots were full and then move onto the >> >> >> next >> >> >> task tracker. Skewing the reducers like this quite clearly hit the >> >> >> benchmark performance. >> >> >> >> >> >> The reason for this turns out to be the fair scheduler rewrite >> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour >> of >> >> >> the >> >> >> assign multiple property. Previously this property caused a single >> map >> >> >> and a >> >> >> single reduce task to be allocated in a task tracker heartbeat >> (rather >> >> >> than >> >> >> the default of a map or a reduce). After the upgrade it allocates >> as >> >> >> many >> >> >> tasks as there are available task slots. Turning off the multiple >> >> >> assignment feature returned the terasort to its pre-upgrade >> >> >> performance. >> >> >> >> >> >> I can see potential benefits to this change and need to think >> through >> >> >> the >> >> >> consequences to real world applications (though in practice we're >> >> >> likely to >> >> >> move away from fair scheduler due to MAPREDUCE-4451). Investigating >> >> >> this >> >> >> has been a pain so to warn other user is there anywhere central that >> >> >> can be >> >> >> used to record upgrade gotchas like this? >> >> >> >> >> >> >> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayaye...@gmail.com> >> >> >> wrote: >> >> >>> >> >> >>> Hi, >> >> >>> >> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and >> have >> >> >>> hit performance problems. Before the upgrade a 15TB terasort took >> >> >>> about 45 >> >> >>> minutes, afterwards it takes just over an hour. Looking in more >> >> >>> detail it >> >> >>> appears the shuffle phase has increased from 20 minutes to 40 >> minutes. >> >> >>> Does >> >> >>> anyone have any thoughts about what'-- >> Harsh J >> >>