Hi Jon, FYI, this issue in the fair scheduler was fixed by https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0. Though it is present again in MR2: https://issues.apache.org/jira/browse/MAPREDUCE-3268
-Todd On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <jayaye...@gmail.com> wrote: > Jie, > > Simple answer - I got lucky (though obviously there are thing you need to > have in place to allow you to be lucky). > > Before running the upgrade I ran a set of tests to baseline the cluster > performance, e.g. terasort, gridmix and some operational jobs. Terasort by > itself isn't very realistic as a cluster test but it's nice and simple to > run and is good for regression testing things after a change. > > After the upgrade the intention was to run the same tests and show that the > performance hadn't degraded (improved would have been nice but not worse was > the minimum). When we ran the terasort we found that performance was about > 50% worse - execution time had gone from 40 minutes to 60 minutes. As I've > said, terasort doesn't provide a realistic view of operational performance > but this showed that something major had changed and we needed to understand > it before going further. So how to go about diagnosing this ... > > First rule - understand what you're trying to achieve. It's very easy to > say performance isn't good enough but performance can always be better so > you need to know what's realistic and at what point you're going to stop > tuning things. I had a previous baseline that I was trying to match so I > knew what I was trying to achieve. > > Next thing to do is profile your job and identify where the problem is. We > had the full job history from the before and after jobs and comparing these > we saw that map performance was fairly consistent as were the reduce sort > and reduce phases. The problem was with the shuffle, which had gone from 20 > minutes pre-upgrade to 40 minutes afterwards. The important thing here is > to make sure you've got as much information as possible. If we'd just kept > the overall job time then there would have been a lot more areas to look at > but knowing the problem was with shuffle allowed me to focus effort in this > area. > > So what had changed in the shuffle that may have slowed things down. The > first thing we thought of was that we'd moved from a tarball deployment to > using the RPM so what effect might this have had on things. Our operational > configuration compresses the map output and in the past we've had problems > with Java compression libraries being used rather than native ones and this > has affected performance. We knew the RPM deployment had moved the native > library so spent some time confirming to ourselves that these were being > used correctly (but this turned out to not be the problem). We then spent > time doing some process and server profiling - using dstat to look at the > server bottlenecks and jstack/jmap to check what the task tracker and reduce > processes were doing. Although not directly relevant to this particular > problem doing this was useful just to get my head around what Hadoop is > doing at various points of the process. > > The next bit was one place where I got lucky - I happened to be logged onto > one of the worker nodes when a test job was running and I noticed that there > weren't any reduce tasks running on the server. This was odd as we'd > submitted more reducers than we have servers so I'd expected at least one > task to be running on each server. Checking the job tracker log file it > turned out that since the upgrade the job tracker had been submitting reduce > tasks to only 10% of the available nodes. A different 10% each time the job > was run so clearly the individual task trackers were working OK but there > was something odd going on with the task allocation. Checking the job > tracker log file showed that before the upgrade tasks had been fairly evenly > distributed so something had changed. After that it was a case of digging > around the source code to find out which classes were available for task > allocation and what inside them had changed. This can be quite daunting but > if you're comfortable with Java then it's just a case of following the calls > through the code. Once I found the cause it was just a case of working out > what my options were for working around it (in this case turning off the > multiple assignment option - I can work out whether I want to turn it back > on in slower time). > > Where I think we got very lucky is that we hit this problem. The > configuration we use for the terasort has just over 1 reducer per worker > node rather than maxing out the available reducer slots. This decision was > made several years and I can't remember the reasons for it. If we'd been > using a larger number of reducers then the number of worker nodes in use > would have been similar regardless of the allocation algorithm and so the > performance would have looked similar before and after the upgrade. We > would have hit this problem eventually but probably not until we started > running user jobs and by then it would be too late to do the intrusive > investigations that were possible now. > > Hope this has been useful. > > Regards, > Jon > > > > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote: >> >> Jon: >> >> This is interesting and helpful! How did you figure out the cause? And how >> much time did you spend? Could you share some experience of performance >> diagnosis? >> >> Jie >> >> On Tuesday, November 27, 2012, Harsh J wrote: >>> >>> Hi Amit, >>> >>> The default scheduler is FIFO, and may not work for all forms of >>> workloads. Read the multiple schedulers available to see if they have >>> features that may benefit your workload: >>> >>> Capacity Scheduler: >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html >>> FairScheduler: >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html >>> >>> While there's a good overlap of features between them, there are a few >>> differences that set them apart and make them each useful for >>> different use-cases. If I had to summarize on some such differences, >>> FairScheduler is better suited to SLA form of job execution situations >>> due to its preemptive features (which make it useful in user and >>> service mix scenarios), while CapacityScheduler provides >>> manual-resource-request oriented scheduling for odd jobs with high >>> memory workloads, etc. (which make it useful for running certain >>> specific kind of jobs along side the regular ones). >>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote: >>> > So this is a FairScheduler problem ? >>> > We are using the default Hadoop scheduler. Is there a reason to use the >>> > Fair >>> > Scheduler if most of the time we don't have more than 4 jobs running >>> > simultaneously ? >>> > >>> > >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote: >>> >> >>> >> Hi Amit, >>> >> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler >>> >> property. It is true by default, which works well for most workloads >>> >> if not benchmark style workloads. I would not usually trust that as a >>> >> base perf. measure of everything that comes out of an upgrade. >>> >> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0. >>> >> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> >>> >> wrote: >>> >> > Hi Jon, >>> >> > >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop >>> >> > 1.0.4 >>> >> > and I haven't noticed any performance issues. By "multiple >>> >> > assignment >>> >> > feature" do you mean speculative execution >>> >> > (mapred.map.tasks.speculative.execution and >>> >> > mapred.reduce.tasks.speculative.execution) ? >>> >> > >>> >> > >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <jayaye...@gmail.com> >>> >> > wrote: >>> >> >> >>> >> >> Problem solved, but worth warning others about. >>> >> >> >>> >> >> Before the upgrade the reducers for the terasort process had been >>> >> >> evenly >>> >> >> distributed around the cluster - one per task tracker in turn, >>> >> >> looping >>> >> >> around the cluster until all tasks were allocated. After the >>> >> >> upgrade >>> >> >> all >>> >> >> reduce task had been submitted to small number of task trackers - >>> >> >> submit >>> >> >> tasks until the task tracker slots were full and then move onto the >>> >> >> next >>> >> >> task tracker. Skewing the reducers like this quite clearly hit the >>> >> >> benchmark performance. >>> >> >> >>> >> >> The reason for this turns out to be the fair scheduler rewrite >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour >>> >> >> of >>> >> >> the >>> >> >> assign multiple property. Previously this property caused a single >>> >> >> map >>> >> >> and a >>> >> >> single reduce task to be allocated in a task tracker heartbeat >>> >> >> (rather >>> >> >> than >>> >> >> the default of a map or a reduce). After the upgrade it allocates >>> >> >> as >>> >> >> many >>> >> >> tasks as there are available task slots. Turning off the multiple >>> >> >> assignment feature returned the terasort to its pre-upgrade >>> >> >> performance. >>> >> >> >>> >> >> I can see potential benefits to this change and need to think >>> >> >> through >>> >> >> the >>> >> >> consequences to real world applications (though in practice we're >>> >> >> likely to >>> >> >> move away from fair scheduler due to MAPREDUCE-4451). >>> >> >> Investigating >>> >> >> this >>> >> >> has been a pain so to warn other user is there anywhere central >>> >> >> that >>> >> >> can be >>> >> >> used to record upgrade gotchas like this? >>> >> >> >>> >> >> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayaye...@gmail.com> >>> >> >> wrote: >>> >> >>> >>> >> >>> Hi, >>> >> >>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and >>> >> >>> have >>> >> >>> hit performance problems. Before the upgrade a 15TB terasort took >>> >> >>> about 45 >>> >> >>> minutes, afterwards it takes just over an hour. Looking in more >>> >> >>> detail it >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40 >>> >> >>> minutes. >>> >> >>> Does >>> >> >>> anyone have any thoughts about what'-- >>> Harsh J >>> > -- Todd Lipcon Software Engineer, Cloudera