Added to HBASE-2888 (Review all our metrics) issue. Thanks for support, Lars!
Alex Baranau ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase On Mon, Nov 22, 2010 at 11:05 AM, Lars George <[email protected]> wrote: > Hi Alex, > > We once had a class called Historian that would log the splits etc. It was > abandoned as it was not implemented right. Currently we only have those logs > and I am personally agreeing with you that this is not good for the average > (or even advanced) user. In the future we may be able to use Coprocessors to > handle region split info being written somewhere, but for now there is a > void. > > I 100% agree with you that this info should be in a metric as it is vital > information of what the system is doing and all the core devs are used to > read the logs directly, which I think is ok for them but madness otherwise. > Please have look if there is a JIRA to this event and if not create a new > one so that we can improve this for everyone. I am strongly for these sort > of additions and will help you getting this committed if you could provide a > patch. > > Lars > > On Nov 22, 2010, at 9:38, Alex Baranau <[email protected]> wrote: > > > Hello Lars, > > > >> But ideally you would do a post > >> mortem on the master and slave logs for Hadoop and HBase, since that > >> would give you a better insight of the events. For example, when did > >> the system start to flush, when did it start compacting, when did the > >> HDFS start to go slow? > > > > I wonder, does it makes sense to expose these events somehow to the HBase > > web interface in an easy accessible way: e.g. list of time points of > splits > > (for each regionserver), compaction/flush events (or rate), etc. Looks > like > > this info is valuable for many users and most importantly I believe can > > affect their configuration-related decisions, so showing the data on web > > interface instead of making users dig into logs makes sense and brings > HBase > > towards "easy-to-install&use" a bit. Thoughts? > > > > btw, I found that in JMX we expose only flush and compaction related > data. > > Nothing related to splits. Could you give a hint why? Also we have only > time > > and size being exposed, probably count (number of actions) would be good > to > > expose too: thus one can see flush/compaction/split(?) rate and make > > judgement on whether some configuration is properly set (e.g. > > hbase.hregion.memstore.flush.size). > > > > Thanks, > > Alex Baranau > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - > HBase > > > > On Fri, Nov 19, 2010 at 5:16 PM, Lars George <[email protected]> > wrote: > > > >> Hi Henning, > >> > >> And you what you have seen is often difficult to explain. What I > >> listed are the obvious contenders. But ideally you would do a post > >> mortem on the master and slave logs for Hadoop and HBase, since that > >> would give you a better insight of the events. For example, when did > >> the system start to flush, when did it start compacting, when did the > >> HDFS start to go slow? And so on. One thing that I would highly > >> recommend (if you haven't done so already) is getting graphing going. > >> Use the build in Ganglia support and you may be able to at least > >> determine the overall load on the system and various metrics of Hadoop > >> and HBase. > >> > >> Did you use the normal Puts or did you set it to cache Puts and write > >> them in bulk? See HTable.setWriteBufferSize() and > >> HTable.setAutoFlush() for details (but please note that you then do > >> need to call HTable.flushCommits() in your close() method of the > >> mapper class). That will help a lot speeding up writing data. > >> > >> Lars > >> > >> On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm < > [email protected]> > >> wrote: > >>> Hi Lars, > >>> > >>> thanks. Yes, this is just the first test setup. Eventually the data > >>> load will be significantly higher. > >>> > >>> At the moment (looking at the master after the run) the number of > >>> regions is well-distributed (684,685,685 regions). The overall > >>> HDFS use is ~700G. (replication factor is 3 btw). > >>> > >>> I will want to upgrade as soon as that makes sense. It seems there is > >>> "release" after 0.20.6 - that's why we are still with that one. > >>> > >>> When I do that run again, I will check the master UI and see how things > >>> develop there. As for the current run: I do not expect > >>> to get stable numbers early in the run. What looked suspicous was that > >>> things got gradually worse until well into 30 hours after > >>> the start of the run and then even got better. An unexpected load > >>> behavior for me (would have expected early changes but then > >>> some stable behavior up to the end). > >>> > >>> Thanks, > >>> Henning > >>> > >>> Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George: > >>> > >>>> Hi Henning, > >>>> > >>>> Could you look at the Master UI while doing the import? The issue with > >>>> a cold bulk import is that you are hitting one region server > >>>> initially, and while it is filling up its in-memory structures all is > >>>> nice and dandy. Then ou start to tax the server as it has to flush > >>>> data out and it becomes slower responding to the mappers still > >>>> hammering it. Only after a while the regions become large enough so > >>>> that they get split and load starts to spread across 2 machines, then > >>>> 3. Eventually you have enough regions to handle your data and you will > >>>> see an average of the performance you could expect from a loaded > >>>> cluster. For that reason we have added a bulk loading feature that > >>>> helps building the region files externally and then swap them in. > >>>> > >>>> When you check the UI you can actually see this behavior as the > >>>> operations-per-second (ops) are bound to one server initially. Well, > >>>> could be two as one of them has to also serve META. If that is the > >>>> same machine then you are penalized twice. > >>>> > >>>> In addition you start to run into minor compaction load while HBase > >>>> tries to do housekeeping during your load. > >>>> > >>>> With 0.89 you could pre-split the regions into what you see eventually > >>>> when your job is complete. Please use the UI to check and let us know > >>>> how many regions you end up with in total (out of interest mainly). If > >>>> you had that done before the import then the load is split right from > >>>> the start. > >>>> > >>>> In general 0.89 is much better performance wise when it comes to bulk > >>>> loads so you may want to try it out as well. The 0.90RC is up so a > >>>> release is imminent and saves you from having to upgrade soon. Also, > >>>> 0.90 is the first with Hadoop's append fix, so that you do not lose > >>>> any data from wonky server behavior. > >>>> > >>>> And to wrap this up, 3 data nodes is not too great. If you ask anyone > >>>> with a serious production type setup you will see 10 machines and > >>>> more, I'd say 20-30 machines and up. Some would say "Use MySQL for > >>>> this little data" but that is not fair given that we do not know what > >>>> your targets are. Bottom line is, you will see issues (like slowness) > >>>> with 3 nodes that 8 or 10 nodes will never show. > >>>> > >>>> HTH, > >>>> Lars > >>>> > >>>> > >>>> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm < > >> [email protected]> wrote: > >>>>> We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes > >>>>> (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a > >>>>> relatively simple > >>>>> table in HBase (1 column familiy, 5 columns, rowkey about 100chars). > >>>>> > >>>>> In order to better understand the load behavior, I wanted to put > >> 5*10^8 > >>>>> rows into that table. I wrote an M/R job that uses a Split Input > >> Format > >>>>> to split the > >>>>> 5*10^8 logical row keys (essentially just counting from 0 to > 5*10^8-1) > >>>>> into 1000 chunks of 500000 keys and then let the map do the actual > job > >>>>> of writing the corresponding rows (with some random column values) > >> into > >>>>> hbase. > >>>>> > >>>>> So there are 1000 map tasks, no reducer. Each task writes 500000 rows > >>>>> into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel. > >>>>> > >>>>> The whole job runs for approx. 48 hours. Initially the map tasks need > >>>>> around 30 min. each. After a while things take longer and longer, > >>>>> eventually > >>>>> reaching > 2h. It tops around the 850s task after which things speed > >> up > >>>>> again improving to about 48min. in the end, until completed. > >>>>> > >>>>> It's all dedicated machines and there is nothing else running. The > map > >>>>> tasks have 200m heap and when checking with vmstat in between I > cannot > >>>>> observe swapping. > >>>>> > >>>>> Also, on the master it seems that heap utilization is not at the > limit > >>>>> and no swapping either. All Hadoop and Hbase processes have > >>>>> 1G heap. > >>>>> > >>>>> Any idea what would cause the strong variation (or degradation) of > >> write > >>>>> performance? > >>>>> Is there a way of finding out where time gets lost? > >>>>> > >>>>> Thanks, > >>>>> Henning > >>>>> > >>>>> > >>> > >> >
