Re: Hadoop 1.0.4 Performance Problem

Jon Allen Wed, 28 Nov 2012 14:33:00 -0800

Jie,

Simple answer - I got lucky (though obviously there are thing you need to
have in place to allow you to be lucky).

Before running the upgrade I ran a set of tests to baseline the cluster
performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
itself isn't very realistic as a cluster test but it's nice and simple to
run and is good for regression testing things after a change.

After the upgrade the intention was to run the same tests and show that the
performance hadn't degraded (improved would have been nice but not worse
was the minimum).  When we ran the terasort we found that performance was
about 50% worse - execution time had gone from 40 minutes to 60 minutes.
 As I've said, terasort doesn't provide a realistic view of operational
performance but this showed that something major had changed and we needed
to understand it before going further.  So how to go about diagnosing this
...

First rule - understand what you're trying to achieve.  It's very easy to
say performance isn't good enough but performance can always be better so
you need to know what's realistic and at what point you're going to stop
tuning things.  I had a previous baseline that I was trying to match so I
knew what I was trying to achieve.

Next thing to do is profile your job and identify where the problem is.  We
had the full job history from the before and after jobs and comparing these
we saw that map performance was fairly consistent as were the reduce sort
and reduce phases.  The problem was with the shuffle, which had gone from
20 minutes pre-upgrade to 40 minutes afterwards.  The important thing here
is to make sure you've got as much information as possible.  If we'd just
kept the overall job time then there would have been a lot more areas to
look at but knowing the problem was with shuffle allowed me to focus effort
in this area.

So what had changed in the shuffle that may have slowed things down.  The
first thing we thought of was that we'd moved from a tarball deployment to
using the RPM so what effect might this have had on things.  Our
operational configuration compresses the map output and in the past we've
had problems with Java compression libraries being used rather than native
ones and this has affected performance.  We knew the RPM deployment had
moved the native library so spent some time confirming to ourselves that
these were being used correctly (but this turned out to not be the
problem).  We then spent time doing some process and server profiling -
using dstat to look at the server bottlenecks and jstack/jmap to check what
the task tracker and reduce processes were doing.  Although not directly
relevant to this particular problem doing this was useful just to get my
head around what Hadoop is doing at various points of the process.

The next bit was one place where I got lucky - I happened to be logged onto
one of the worker nodes when a test job was running and I noticed that
there weren't any reduce tasks running on the server.  This was odd as we'd
submitted more reducers than we have servers so I'd expected at least one
task to be running on each server.  Checking the job tracker log file it
turned out that since the upgrade the job tracker had been submitting
reduce tasks to only 10% of the available nodes.  A different 10% each time
the job was run so clearly the individual task trackers were working OK but
there was something odd going on with the task allocation.  Checking the
job tracker log file showed that before the upgrade tasks had been fairly
evenly distributed so something had changed.  After that it was a case of
digging around the source code to find out which classes were available for
task allocation and what inside them had changed.  This can be quite
daunting but if you're comfortable with Java then it's just a case of
following the calls through the code.  Once I found the cause it was just a
case of working out what my options were for working around it (in this
case turning off the multiple assignment option - I can work out whether I
want to turn it back on in slower time).

Where I think we got very lucky is that we hit this problem.  The
configuration we use for the terasort has just over 1 reducer per worker
node rather than maxing out the available reducer slots.  This decision was
made several years and I can't remember the reasons for it.  If we'd been
using a larger number of reducers then the number of worker nodes in use
would have been similar regardless of the allocation algorithm and so the
performance would have looked similar before and after the upgrade.  We
would have hit this problem eventually but probably not until we started
running user jobs and by then it would be too late to do the intrusive
investigations that were possible now.

Hope this has been useful.

Regards,
Jon

On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:

> Jon:
>
> This is interesting and helpful! How did you figure out the cause? And how
> much time did you spend? Could you share some experience of performance
> diagnosis?
>
> Jie
>
> On Tuesday, November 27, 2012, Harsh J wrote:
>
>> Hi Amit,
>>
>> The default scheduler is FIFO, and may not work for all forms of
>> workloads. Read the multiple schedulers available to see if they have
>> features that may benefit your workload:
>>
>> Capacity Scheduler:
>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> FairScheduler:
>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>
>> While there's a good overlap of features between them, there are a few
>> differences that set them apart and make them each useful for
>> different use-cases. If I had to summarize on some such differences,
>> FairScheduler is better suited to SLA form of job execution situations
>> due to its preemptive features (which make it useful in user and
>> service mix scenarios), while CapacityScheduler provides
>> manual-resource-request oriented scheduling for odd jobs with high
>> memory workloads, etc. (which make it useful for running certain
>> specific kind of jobs along side the regular ones).
>>
>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>> > So this is a FairScheduler problem ?
>> > We are using the default Hadoop scheduler. Is there a reason to use the
>> Fair
>> > Scheduler if most of the time we don't have more than 4 jobs running
>> > simultaneously ?
>> >
>> >
>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hi Amit,
>> >>
>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >> property. It is true by default, which works well for most workloads
>> >> if not benchmark style workloads. I would not usually trust that as a
>> >> base perf. measure of everything that comes out of an upgrade.
>> >>
>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>
>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> wrote:
>> >> > Hi Jon,
>> >> >
>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> >> > 1.0.4
>> >> > and I haven't noticed any performance issues. By  "multiple
>> assignment
>> >> > feature" do you mean speculative execution
>> >> > (mapred.map.tasks.speculative.execution and
>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >> >
>> >> >
>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <jayaye...@gmail.com>
>> wrote:
>> >> >>
>> >> >> Problem solved, but worth warning others about.
>> >> >>
>> >> >> Before the upgrade the reducers for the terasort process had been
>> >> >> evenly
>> >> >> distributed around the cluster - one per task tracker in turn,
>> looping
>> >> >> around the cluster until all tasks were allocated.  After the
>> upgrade
>> >> >> all
>> >> >> reduce task had been submitted to small number of task trackers -
>> >> >> submit
>> >> >> tasks until the task tracker slots were full and then move onto the
>> >> >> next
>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> >> benchmark performance.
>> >> >>
>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>> of
>> >> >> the
>> >> >> assign multiple property. Previously this property caused a single
>> map
>> >> >> and a
>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> (rather
>> >> >> than
>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>> as
>> >> >> many
>> >> >> tasks as there are available task slots.  Turning off the multiple
>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >> >> performance.
>> >> >>
>> >> >> I can see potential benefits to this change and need to think
>> through
>> >> >> the
>> >> >> consequences to real world applications (though in practice we're
>> >> >> likely to
>> >> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> >> >> this
>> >> >> has been a pain so to warn other user is there anywhere central that
>> >> >> can be
>> >> >> used to record upgrade gotchas like this?
>> >> >>
>> >> >>
>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayaye...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>> have
>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> >> >>> about 45
>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> >> >>> detail it
>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> minutes.
>> >> >>> Does
>> >> >>> anyone have any thoughts about what'--
>> Harsh J
>>
>>

Re: Hadoop 1.0.4 Performance Problem

Reply via email to