Re: map task performance degradation - any idea why?

Alex Baranau Mon, 22 Nov 2010 23:45:38 -0800

Added to HBASE-2888 (Review all our metrics) issue.

Thanks for support, Lars!


Alex Baranau
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase

On Mon, Nov 22, 2010 at 11:05 AM, Lars George <[email protected]> wrote:

> Hi Alex,
>
> We once had a class called Historian that would log the splits etc.  It was
> abandoned as it was not implemented right. Currently we only have those logs
> and I am personally agreeing with you that this is not good for the average
> (or even advanced) user. In the future we may be able to use Coprocessors to
> handle region split info being written somewhere, but for now there is a
> void.
>
> I 100% agree with you that this info should be in a metric as it is vital
> information of what the system is doing and all the core devs are used to
> read the logs directly, which I think is ok for them but madness otherwise.
> Please have look if there is a JIRA to this event and if not create a new
> one so that we can improve this for everyone. I am strongly for these sort
> of additions and will help you getting this committed if you could provide a
> patch.
>
> Lars
>
> On Nov 22, 2010, at 9:38, Alex Baranau <[email protected]> wrote:
>
> > Hello Lars,
> >
> >> But ideally you would do a post
> >> mortem on the master and slave logs for Hadoop and HBase, since that
> >> would give you a better insight of the events. For example, when did
> >> the system start to flush, when did it start compacting, when did the
> >> HDFS start to go slow?
> >
> > I wonder, does it makes sense to expose these events somehow to the HBase
> > web interface in an easy accessible way: e.g. list of time points of
> splits
> > (for each regionserver), compaction/flush events (or rate), etc. Looks
> like
> > this info is valuable for many users and most importantly I believe can
> > affect their configuration-related decisions, so showing the data on web
> > interface instead of making users dig into logs makes sense and brings
> HBase
> > towards "easy-to-install&use" a bit. Thoughts?
> >
> > btw, I found that in JMX we expose only flush and compaction related
> data.
> > Nothing related to splits. Could you give a hint why? Also we have only
> time
> > and size being exposed, probably count (number of actions) would be good
> to
> > expose too: thus one can see flush/compaction/split(?) rate and make
> > judgement on whether some configuration is properly set (e.g.
> > hbase.hregion.memstore.flush.size).
> >
> > Thanks,
> > Alex Baranau
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
> HBase
> >
> > On Fri, Nov 19, 2010 at 5:16 PM, Lars George <[email protected]>
> wrote:
> >
> >> Hi Henning,
> >>
> >> And you what you have seen is often difficult to explain. What I
> >> listed are the obvious contenders. But ideally you would do a post
> >> mortem on the master and slave logs for Hadoop and HBase, since that
> >> would give you a better insight of the events. For example, when did
> >> the system start to flush, when did it start compacting, when did the
> >> HDFS start to go slow? And so on. One thing that I would highly
> >> recommend (if you haven't done so already) is getting graphing going.
> >> Use the build in Ganglia support and you may be able to at least
> >> determine the overall load on the system and various metrics of Hadoop
> >> and HBase.
> >>
> >> Did you use the normal Puts or did you set it to cache Puts and write
> >> them in bulk? See HTable.setWriteBufferSize() and
> >> HTable.setAutoFlush() for details (but please note that you then do
> >> need to call HTable.flushCommits() in your close() method of the
> >> mapper class). That will help a lot speeding up writing data.
> >>
> >> Lars
> >>
> >> On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm <
> [email protected]>
> >> wrote:
> >>> Hi Lars,
> >>>
> >>> thanks. Yes, this is just the first test setup. Eventually the data
> >>> load will be significantly higher.
> >>>
> >>> At the moment (looking at the master after the run) the number of
> >>> regions is well-distributed (684,685,685 regions). The overall
> >>> HDFS use is  ~700G. (replication factor is 3 btw).
> >>>
> >>> I will want to upgrade as soon as that makes sense. It seems there is
> >>> "release" after 0.20.6 - that's why we are still with that one.
> >>>
> >>> When I do that run again, I will check the master UI and see how things
> >>> develop there. As for the current run: I do not expect
> >>> to get stable numbers early in the run. What looked suspicous was that
> >>> things got gradually worse until well into 30 hours after
> >>> the start of the run and then even got better. An unexpected load
> >>> behavior for me (would have expected early changes but then
> >>> some stable behavior up to the end).
> >>>
> >>> Thanks,
> >>> Henning
> >>>
> >>> Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George:
> >>>
> >>>> Hi Henning,
> >>>>
> >>>> Could you look at the Master UI while doing the import? The issue with
> >>>> a cold bulk import is that you are hitting one region server
> >>>> initially, and while it is filling up its in-memory structures all is
> >>>> nice and dandy. Then ou start to tax the server as it has to flush
> >>>> data out and it becomes slower responding to the mappers still
> >>>> hammering it. Only after a while the regions become large enough so
> >>>> that they get split and load starts to spread across 2 machines, then
> >>>> 3. Eventually you have enough regions to handle your data and you will
> >>>> see an average of the performance you could expect from a loaded
> >>>> cluster. For that reason we have added a bulk loading feature that
> >>>> helps building the region files externally and then swap them in.
> >>>>
> >>>> When you check the UI you can actually see this behavior as the
> >>>> operations-per-second (ops) are bound to one server initially. Well,
> >>>> could be two as one of them has to also serve META. If that is the
> >>>> same machine then you are penalized twice.
> >>>>
> >>>> In addition you start to run into minor compaction load while HBase
> >>>> tries to do housekeeping during your load.
> >>>>
> >>>> With 0.89 you could pre-split the regions into what you see eventually
> >>>> when your job is complete. Please use the UI to check and let us know
> >>>> how many regions you end up with in total (out of interest mainly). If
> >>>> you had that done before the import then the load is split right from
> >>>> the start.
> >>>>
> >>>> In general 0.89 is much better performance wise when it comes to bulk
> >>>> loads so you may want to try it out as well. The 0.90RC is up so a
> >>>> release is imminent and saves you from having to upgrade soon. Also,
> >>>> 0.90 is the first with Hadoop's append fix, so that you do not lose
> >>>> any data from wonky server behavior.
> >>>>
> >>>> And to wrap this up, 3 data nodes is not too great. If you ask anyone
> >>>> with a serious production type setup you will see 10 machines and
> >>>> more, I'd say 20-30 machines and up. Some would say "Use MySQL for
> >>>> this little data" but that is not fair given that we do not know what
> >>>> your targets are. Bottom line is, you will see issues (like slowness)
> >>>> with 3 nodes that 8 or 10 nodes will never show.
> >>>>
> >>>> HTH,
> >>>> Lars
> >>>>
> >>>>
> >>>> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm <
> >> [email protected]> wrote:
> >>>>> We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes
> >>>>> (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a
> >>>>> relatively simple
> >>>>> table in HBase (1 column familiy, 5 columns, rowkey about 100chars).
> >>>>>
> >>>>> In order to better understand the load behavior, I wanted to put
> >> 5*10^8
> >>>>> rows into that table. I wrote an M/R job that uses a Split Input
> >> Format
> >>>>> to split the
> >>>>> 5*10^8 logical row keys (essentially just counting from 0 to
> 5*10^8-1)
> >>>>> into 1000 chunks of 500000 keys and then let the map do the actual
> job
> >>>>> of writing the corresponding rows (with some random column values)
> >> into
> >>>>> hbase.
> >>>>>
> >>>>> So there are 1000 map tasks, no reducer. Each task writes 500000 rows
> >>>>> into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel.
> >>>>>
> >>>>> The whole job runs for approx. 48 hours. Initially the map tasks need
> >>>>> around 30 min. each. After a while things take longer and longer,
> >>>>> eventually
> >>>>> reaching > 2h. It tops around the 850s task after which things speed
> >> up
> >>>>> again improving to about 48min. in the end, until completed.
> >>>>>
> >>>>> It's all dedicated machines and there is nothing else running. The
> map
> >>>>> tasks have 200m heap and when checking with vmstat in between I
> cannot
> >>>>> observe swapping.
> >>>>>
> >>>>> Also, on the master it seems that heap utilization is not at the
> limit
> >>>>> and no swapping either. All Hadoop and Hbase processes have
> >>>>> 1G heap.
> >>>>>
> >>>>> Any idea what would cause the strong variation (or degradation) of
> >> write
> >>>>> performance?
> >>>>> Is there a way of finding out where time gets lost?
> >>>>>
> >>>>> Thanks,
> >>>>> Henning
> >>>>>
> >>>>>
> >>>
> >>
>

Re: map task performance degradation - any idea why?

Reply via email to