Hello Lars, > But ideally you would do a post > mortem on the master and slave logs for Hadoop and HBase, since that > would give you a better insight of the events. For example, when did > the system start to flush, when did it start compacting, when did the > HDFS start to go slow?
I wonder, does it makes sense to expose these events somehow to the HBase web interface in an easy accessible way: e.g. list of time points of splits (for each regionserver), compaction/flush events (or rate), etc. Looks like this info is valuable for many users and most importantly I believe can affect their configuration-related decisions, so showing the data on web interface instead of making users dig into logs makes sense and brings HBase towards "easy-to-install&use" a bit. Thoughts? btw, I found that in JMX we expose only flush and compaction related data. Nothing related to splits. Could you give a hint why? Also we have only time and size being exposed, probably count (number of actions) would be good to expose too: thus one can see flush/compaction/split(?) rate and make judgement on whether some configuration is properly set (e.g. hbase.hregion.memstore.flush.size). Thanks, Alex Baranau ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase On Fri, Nov 19, 2010 at 5:16 PM, Lars George <[email protected]> wrote: > Hi Henning, > > And you what you have seen is often difficult to explain. What I > listed are the obvious contenders. But ideally you would do a post > mortem on the master and slave logs for Hadoop and HBase, since that > would give you a better insight of the events. For example, when did > the system start to flush, when did it start compacting, when did the > HDFS start to go slow? And so on. One thing that I would highly > recommend (if you haven't done so already) is getting graphing going. > Use the build in Ganglia support and you may be able to at least > determine the overall load on the system and various metrics of Hadoop > and HBase. > > Did you use the normal Puts or did you set it to cache Puts and write > them in bulk? See HTable.setWriteBufferSize() and > HTable.setAutoFlush() for details (but please note that you then do > need to call HTable.flushCommits() in your close() method of the > mapper class). That will help a lot speeding up writing data. > > Lars > > On Fri, Nov 19, 2010 at 3:43 PM, Henning Blohm <[email protected]> > wrote: > > Hi Lars, > > > > thanks. Yes, this is just the first test setup. Eventually the data > > load will be significantly higher. > > > > At the moment (looking at the master after the run) the number of > > regions is well-distributed (684,685,685 regions). The overall > > HDFS use is ~700G. (replication factor is 3 btw). > > > > I will want to upgrade as soon as that makes sense. It seems there is > > "release" after 0.20.6 - that's why we are still with that one. > > > > When I do that run again, I will check the master UI and see how things > > develop there. As for the current run: I do not expect > > to get stable numbers early in the run. What looked suspicous was that > > things got gradually worse until well into 30 hours after > > the start of the run and then even got better. An unexpected load > > behavior for me (would have expected early changes but then > > some stable behavior up to the end). > > > > Thanks, > > Henning > > > > Am Freitag, den 19.11.2010, 15:21 +0100 schrieb Lars George: > > > >> Hi Henning, > >> > >> Could you look at the Master UI while doing the import? The issue with > >> a cold bulk import is that you are hitting one region server > >> initially, and while it is filling up its in-memory structures all is > >> nice and dandy. Then ou start to tax the server as it has to flush > >> data out and it becomes slower responding to the mappers still > >> hammering it. Only after a while the regions become large enough so > >> that they get split and load starts to spread across 2 machines, then > >> 3. Eventually you have enough regions to handle your data and you will > >> see an average of the performance you could expect from a loaded > >> cluster. For that reason we have added a bulk loading feature that > >> helps building the region files externally and then swap them in. > >> > >> When you check the UI you can actually see this behavior as the > >> operations-per-second (ops) are bound to one server initially. Well, > >> could be two as one of them has to also serve META. If that is the > >> same machine then you are penalized twice. > >> > >> In addition you start to run into minor compaction load while HBase > >> tries to do housekeeping during your load. > >> > >> With 0.89 you could pre-split the regions into what you see eventually > >> when your job is complete. Please use the UI to check and let us know > >> how many regions you end up with in total (out of interest mainly). If > >> you had that done before the import then the load is split right from > >> the start. > >> > >> In general 0.89 is much better performance wise when it comes to bulk > >> loads so you may want to try it out as well. The 0.90RC is up so a > >> release is imminent and saves you from having to upgrade soon. Also, > >> 0.90 is the first with Hadoop's append fix, so that you do not lose > >> any data from wonky server behavior. > >> > >> And to wrap this up, 3 data nodes is not too great. If you ask anyone > >> with a serious production type setup you will see 10 machines and > >> more, I'd say 20-30 machines and up. Some would say "Use MySQL for > >> this little data" but that is not fair given that we do not know what > >> your targets are. Bottom line is, you will see issues (like slowness) > >> with 3 nodes that 8 or 10 nodes will never show. > >> > >> HTH, > >> Lars > >> > >> > >> On Fri, Nov 19, 2010 at 2:09 PM, Henning Blohm < > [email protected]> wrote: > >> > We have a Hadoop 0.20.2 + Hbase 0.20.6 setup with three data nodes > >> > (12GB, 1.5TB each) and one master node (24GB, 1.5TB). We store a > >> > relatively simple > >> > table in HBase (1 column familiy, 5 columns, rowkey about 100chars). > >> > > >> > In order to better understand the load behavior, I wanted to put > 5*10^8 > >> > rows into that table. I wrote an M/R job that uses a Split Input > Format > >> > to split the > >> > 5*10^8 logical row keys (essentially just counting from 0 to 5*10^8-1) > >> > into 1000 chunks of 500000 keys and then let the map do the actual job > >> > of writing the corresponding rows (with some random column values) > into > >> > hbase. > >> > > >> > So there are 1000 map tasks, no reducer. Each task writes 500000 rows > >> > into Hbase. We have 6 mapper slots, i.e. 24 mappers running parallel. > >> > > >> > The whole job runs for approx. 48 hours. Initially the map tasks need > >> > around 30 min. each. After a while things take longer and longer, > >> > eventually > >> > reaching > 2h. It tops around the 850s task after which things speed > up > >> > again improving to about 48min. in the end, until completed. > >> > > >> > It's all dedicated machines and there is nothing else running. The map > >> > tasks have 200m heap and when checking with vmstat in between I cannot > >> > observe swapping. > >> > > >> > Also, on the master it seems that heap utilization is not at the limit > >> > and no swapping either. All Hadoop and Hbase processes have > >> > 1G heap. > >> > > >> > Any idea what would cause the strong variation (or degradation) of > write > >> > performance? > >> > Is there a way of finding out where time gets lost? > >> > > >> > Thanks, > >> > Henning > >> > > >> > > > >
