Jie, Recent was over 11 months ago. :-)
Unfortunately the software licence requires that most of us 'negotiate' a commerical use license before we trial the software in a commercial environment: http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as clarified here: http://www.cs.duke.edu/starfish/previous.html Under that last URL was a note that you were soon to distribute the source code under the Apache Software License. Last time I asked the reply was that this would not happen. Perhaps it is time to update your web pages or your license arrangements. :-) I like what I saw on my home 'cluster' but have not the time to sort out licensing to trial this in a commercial environment. Chris On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote: > Hi Jon, > > Thanks for sharing these insights! Can't agree with you more! > > Recently we released a tool called Starfish Hadoop Log Analyzer for > analyzing the job histories. I believe it can quickly point out this > reduce problem you met! > > http://www.cs.duke.edu/starfish/release.html > > Jie > > On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <jayaye...@gmail.com> wrote: > > Jie, > > > > Simple answer - I got lucky (though obviously there are thing you need to > > have in place to allow you to be lucky). > > > > Before running the upgrade I ran a set of tests to baseline the cluster > > performance, e.g. terasort, gridmix and some operational jobs. Terasort > by > > itself isn't very realistic as a cluster test but it's nice and simple to > > run and is good for regression testing things after a change. > > > > After the upgrade the intention was to run the same tests and show that > the > > performance hadn't degraded (improved would have been nice but not worse > was > > the minimum). When we ran the terasort we found that performance was > about > > 50% worse - execution time had gone from 40 minutes to 60 minutes. As > I've > > said, terasort doesn't provide a realistic view of operational > performance > > but this showed that something major had changed and we needed to > understand > > it before going further. So how to go about diagnosing this ... > > > > First rule - understand what you're trying to achieve. It's very easy to > > say performance isn't good enough but performance can always be better so > > you need to know what's realistic and at what point you're going to stop > > tuning things. I had a previous baseline that I was trying to match so I > > knew what I was trying to achieve. > > > > Next thing to do is profile your job and identify where the problem is. > We > > had the full job history from the before and after jobs and comparing > these > > we saw that map performance was fairly consistent as were the reduce sort > > and reduce phases. The problem was with the shuffle, which had gone > from 20 > > minutes pre-upgrade to 40 minutes afterwards. The important thing here > is > > to make sure you've got as much information as possible. If we'd just > kept > > the overall job time then there would have been a lot more areas to look > at > > but knowing the problem was with shuffle allowed me to focus effort in > this > > area. > > > > So what had changed in the shuffle that may have slowed things down. The > > first thing we thought of was that we'd moved from a tarball deployment > to > > using the RPM so what effect might this have had on things. Our > operational > > configuration compresses the map output and in the past we've had > problems > > with Java compression libraries being used rather than native ones and > this > > has affected performance. We knew the RPM deployment had moved the > native > > library so spent some time confirming to ourselves that these were being > > used correctly (but this turned out to not be the problem). We then > spent > > time doing some process and server profiling - using dstat to look at the > > server bottlenecks and jstack/jmap to check what the task tracker and > reduce > > processes were doing. Although not directly relevant to this particular > > problem doing this was useful just to get my head around what Hadoop is > > doing at various points of the process. > > > > The next bit was one place where I got lucky - I happened to be logged > onto > > one of the worker nodes when a test job was running and I noticed that > there > > weren't any reduce tasks running on the server. This was odd as we'd > > submitted more reducers than we have servers so I'd expected at least one > > task to be running on each server. Checking the job tracker log file it > > turned out that since the upgrade the job tracker had been submitting > reduce > > tasks to only 10% of the available nodes. A different 10% each time the > job > > was run so clearly the individual task trackers were working OK but there > > was something odd going on with the task allocation. Checking the job > > tracker log file showed that before the upgrade tasks had been fairly > evenly > > distributed so something had changed. After that it was a case of > digging > > around the source code to find out which classes were available for task > > allocation and what inside them had changed. This can be quite daunting > but > > if you're comfortable with Java then it's just a case of following the > calls > > through the code. Once I found the cause it was just a case of working > out > > what my options were for working around it (in this case turning off the > > multiple assignment option - I can work out whether I want to turn it > back > > on in slower time). > > > > Where I think we got very lucky is that we hit this problem. The > > configuration we use for the terasort has just over 1 reducer per worker > > node rather than maxing out the available reducer slots. This decision > was > > made several years and I can't remember the reasons for it. If we'd been > > using a larger number of reducers then the number of worker nodes in use > > would have been similar regardless of the allocation algorithm and so the > > performance would have looked similar before and after the upgrade. We > > would have hit this problem eventually but probably not until we started > > running user jobs and by then it would be too late to do the intrusive > > investigations that were possible now. > > > > Hope this has been useful. > > > > Regards, > > Jon > > > > > > > > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote: > >> > >> Jon: > >> > >> This is interesting and helpful! How did you figure out the cause? And > how > >> much time did you spend? Could you share some experience of performance > >> diagnosis? > >> > >> Jie > >> > >> On Tuesday, November 27, 2012, Harsh J wrote: > >>> > >>> Hi Amit, > >>> > >>> The default scheduler is FIFO, and may not work for all forms of > >>> workloads. Read the multiple schedulers available to see if they have > >>> features that may benefit your workload: > >>> > >>> Capacity Scheduler: > >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html > >>> FairScheduler: > >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html > >>> > >>> While there's a good overlap of features between them, there are a few > >>> differences that set them apart and make them each useful for > >>> different use-cases. If I had to summarize on some such differences, > >>> FairScheduler is better suited to SLA form of job execution situations > >>> due to its preemptive features (which make it useful in user and > >>> service mix scenarios), while CapacityScheduler provides > >>> manual-resource-request oriented scheduling for odd jobs with high > >>> memory workloads, etc. (which make it useful for running certain > >>> specific kind of jobs along side the regular ones). > >>> > >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> > wrote: > >>> > So this is a FairScheduler problem ? > >>> > We are using the default Hadoop scheduler. Is there a reason to use > the > >>> > Fair > >>> > Scheduler if most of the time we don't have more than 4 jobs running > >>> > simultaneously ? > >>> > > >>> > > >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> > wrote: > >>> >> > >>> >> Hi Amit, > >>> >> > >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler > >>> >> property. It is true by default, which works well for most workloads > >>> >> if not benchmark style workloads. I would not usually trust that as > a > >>> >> base perf. measure of everything that comes out of an upgrade. > >>> >> > >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0. > >>> >> > >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> > >>> >> wrote: > >>> >> > Hi Jon, > >>> >> > > >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to > Hadoop > >>> >> > 1.0.4 > >>> >> > and I haven't noticed any performance issues. By "multiple > >>> >> > assignment > >>> >> > feature" do you mean speculative execution > >>> >> > (mapred.map.tasks.speculative.execution and > >>> >> > mapred.reduce.tasks.speculative.execution) ? > >>> >> > > >>> >> > > >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <jayaye...@gmail.com> > >>> >> > wrote: > >>> >> >> > >>> >> >> Problem solved, but worth warning others about. > >>> >> >> > >>> >> >> Before the upgrade the reducers for the terasort process had been > >>> >> >> evenly > >>> >> >> distributed around the cluster - one per task tracker in turn, > >>> >> >> looping > >>> >> >> around the cluster until all tasks were allocated. After the > >>> >> >> upgrade > >>> >> >> all > >>> >> >> reduce task had been submitted to small number of task trackers - > >>> >> >> submit > >>> >> >> tasks until the task tracker slots were full and then move onto > the > >>> >> >> next > >>> >> >> task tracker. Skewing the reducers like this quite clearly hit > the > >>> >> >> benchmark performance. > >>> >> >> > >>> >> >> The reason for this turns out to be the fair scheduler rewrite > >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the > behaviour > >>> >> >> of > >>> >> >> the > >>> >> >> assign multiple property. Previously this property caused a > single > >>> >> >> map > >>> >> >> and a > >>> >> >> single reduce task to be allocated in a task tracker heartbeat > >>> >> >> (rather > >>> >> >> than > >>> >> >> the default of a map or a reduce). After the upgrade it > allocates > >>> >> >> as > >>> >> >> many > >>> >> >> tasks as there are available task slots. Turning off the > multiple > >>> >> >> assignment feature returned the terasort to its pre-upgrade > >>> >> >> performance. > >>> >> >> > >>> >> >> I can see potential benefits to this change and need to think > >>> >> >> through > >>> >> >> the > >>> >> >> consequences to real world applications (though in practice we're > >>> >> >> likely to > >>> >> >> move away from fair scheduler due to MAPREDUCE-4451). > >>> >> >> Investigating > >>> >> >> this > >>> >> >> has been a pain so to warn other user is there anywhere central > >>> >> >> that > >>> >> >> can be > >>> >> >> used to record upgrade gotchas like this? > >>> >> >> > >>> >> >> > >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayaye...@gmail.com > > > >>> >> >> wrote: > >>> >> >>> > >>> >> >>> Hi, > >>> >> >>> > >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 > and > >>> >> >>> have > >>> >> >>> hit performance problems. Before the upgrade a 15TB terasort > took > >>> >> >>> about 45 > >>> >> >>> minutes, afterwards it takes just over an hour. Looking in more > >>> >> >>> detail it > >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40 > >>> >> >>> minutes. > >>> >> >>> Does > >>> >> >>> anyone have any thoughts about what'-- > >>> Harsh J > >>> > > >