Re: Hadoop 1.0.4 Performance Problem

Chris Smith Mon, 17 Dec 2012 09:02:50 -0800

Jie,

Recent was over 11 months ago.  :-)


Unfortunately the software licence requires that most of us 'negotiate' a
commerical use license before we trial the software in a commercial
environment:
http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
clarified here:  http://www.cs.duke.edu/starfish/previous.html

Under that last URL was a note that you were soon to distribute the source
code under the Apache Software License.  Last time I asked the reply was
that this would not happen.  Perhaps it is time to update your web pages or
your license arrangements.  :-)

I like what I saw on my home 'cluster' but have not the time to sort out
licensing to trial this in a commercial environment.

Chris




On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:

> Hi Jon,
>
> Thanks for sharing these insights! Can't agree with you more!
>
> Recently we released a tool called Starfish Hadoop Log Analyzer for
> analyzing the job histories. I believe it can quickly point out this
> reduce problem you met!
>
> http://www.cs.duke.edu/starfish/release.html
>
> Jie
>
> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <jayaye...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <jayaye...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayaye...@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>

Re: Hadoop 1.0.4 Performance Problem

Reply via email to