Regarding du -sk, take a look here https://issues.apache.org/jira/browse/HADOOP-9884
Also hardly waiting for this one to be fixed. On Tue, Mar 21, 2017 at 4:09 PM Hef <[email protected]> wrote: > There were several curious things we have observed: > One the region servers, there were abnormal much more reads than writes: > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 608.00 6552.00 0.00 6552 0 > sdb 345.00 2692.00 78868.00 2692 78868 > sdc 406.00 14548.00 63960.00 14548 63960 > sdd 2.00 0.00 32.00 0 32 > sde 62.00 8764.00 0.00 8764 0 > sdf 498.00 11100.00 32.00 11100 32 > sdg 2080.00 11712.00 0.00 11712 0 > sdh 109.00 5072.00 0.00 5072 0 > sdi 158.00 4.00 32228.00 4 32228 > sdj 43.00 5648.00 32.00 5648 32 > sdk 255.00 3784.00 0.00 3784 0 > sdl 86.00 1412.00 9176.00 1412 9176 > > In CDH region server dashboard, the Average Disk IOPS for writes were > stable on 735/s, while the reads raised from 900/s to 5000/s every 5 > minutes. > > iotop shown the following processes were eating the most io: > 6447 be/4 hdfs 2.70 M/s 0.00 B/s 0.00 % 94.54 % du -sk > /data/12/dfs/dn/curre~632-10.1.1.100-1457937043486 > 6023 be/4 hdfs 2.54 M/s 0.00 B/s 0.00 % 92.14 % du -sk > /data/9/dfs/dn/curren~632-10.1.1.100-1457937043486 > 6186 be/4 hdfs 1379.58 K/s 0.00 B/s 0.00 % 90.78 % du -sk > /data/11/dfs/dn/curre~632-10.1.1.100-1457937043486 > > What were all this reading for? And what are thos du -sk processes? Could > this be a reason to slow down the write throughput? > > > > On Tue, Mar 21, 2017 at 7:48 PM, Hef <[email protected]> wrote: > > > Hi guys, > > Thanks for all your hints. > > Let me summarize the tuning I have done these days. > > Initially, before tuning, HBase cluster worked at an average write tps of > > 400k tps (600k tps at max). The total network TX throughputs from > > clients(aggregated from multiple servers) to RegionServers shown 300Mb/s > > in average. > > > > I adopted the following steps for tuning: > > 1. optimized the HBase schema for our table, deducted the cells size by > > 40%. > > Result: > > failed, tps not obviously increased > > > > 2. Recreated the table by more evenly distribution of pre-split keyspace > > Result: > > failed, tps not obviously increased > > > > 3. Adjusted RS GC strategy: > > Before: > > -XX:+UseParNewGC > > -XX:+UseConcMarkSweepGC > > -XX:CMSInitiatingOccupancyFraction=70 > > -XX:+CMSParallelRemarkEnabled > > -Xmx100g > > -Xms100g > > -Xmn20g > > > > After: > > -XX:+UseG1GC > > -XX:+UnlockExperimentalVMOptions > > -XX:MaxGCPauseMillis=50 > > -XX:-OmitStackTraceInFastThrow > > -XX:ParallelGCThreads=18 > > -XX:+ParallelRefProcEnabled > > -XX:+PerfDisableSharedMem > > -XX:-ResizePLAB > > -XX:G1NewSizePercent=8 > > -Xms100G -Xmx100G > > -XX:MaxTenuringThreshold=1 > > -XX:G1HeapWastePercent=10 > > -XX:G1MixedGCCountTarget=16 > > -XX:G1HeapRegionSize=32M > > > > Result: > > Success. GC pause time reduced, tps increased by at least 10% > > > > 4. Upgraded to CDH5.9.1 HBase 1.2, also updated client lib to HBase1.2 > > Success: > > 1. total client TX throughput raised to 700Mb/s > > 2. HBase write tps raised to 600k/s in average and 800k/s at max > > > > 5. Other configurations: > > hbase.hstore.compactionThreshold = 10 > > hbase.hstore.blockingStoreFiles = 300 > > hbase.hstore.compaction.max = 20 > > hbase.regionserver.thread.compaction.small = 30 > > > > hbase.hregion.memstore.flush.size = 128 > > hbase.regionserver.global.memstore.lowerLimit = 0.3 > > hbase.regionserver.global.memstore.upperLimit = 0.7 > > > > hbase.regionserver.maxlogs = 100 > > hbase.wal.regiongrouping.numgroups = 5 > > hbase.wal.provider = Multiple HDFS WAL > > > > > > > > Summary: > > 1. HBase 1.2 does have better performance than 1.0 > > 2. 300k/s tps per RegionServer still looks not satisfied, as I can > see > > the CPU/network/IO/memory still have a lot idle resources. > > Per RS: > > 1. CPU 50% used (Not sure why cpu is so high for only 300K writer > > requests) > > 2. JVM Heap, 40% used > > 3. total disks throughput over 12 HDDs, 91MB/s on write and > 40MB/s > > on read > > 4. Network in/out 560Mb/s on 1G NIC > > > > > > Further questions: > > Does anyone confront a similiar heavy write scenario like this? > > How much concurrent writes can a RegionServer handle? Can any one share > > how much tps can your RS reach at max? > > > > Thanks > > Hef > > > > > > > > > > > > > > On Sat, Mar 18, 2017 at 1:11 PM, Yu Li <[email protected]> wrote: > > > >> First please try out stack's suggestion, all good ones. > >> > >> And some supplement: since all disks in use are HDD w/ normal IO > >> capability, it's important to control big IO rate like flush and > >> compaction. Try below features out: > >> 1. HBASE-8329 <https://issues.apache.org/jira/browse/HBASE-8329>: Limit > >> compaction speed (available in 1.1.0+) > >> 2. HBASE-14969 <https://issues.apache.org/jira/browse/HBASE-14969>: Add > >> throughput controller for flush (available in 1.3.0) > >> 3. HBASE-10201 <https://issues.apache.org/jira/browse/HBASE-10201>: Per > >> column family flush (available in 1.1.0+) > >> * HBASE-14906 <https://issues.apache.org/jira/browse/HBASE-14906>: > >> Improvements on FlushLargeStoresPolicy (only available in 2.0, not > >> released > >> yet) > >> > >> Also try out multiple WAL, we observed ~20% write perf boost in prod. > See > >> more details in the doc attached in below JIRA: > >> - HBASE-14457 <https://issues.apache.org/jira/browse/HBASE-14457>: > >> Umbrella: > >> Improve Multiple WAL for production usage > >> > >> And please note that if you decided to pick up a branch-1.1 release, > make > >> sure to use 1.1.3+, or you may hit some perf regression issue on writes, > >> see HBASE-14460 <https://issues.apache.org/jira/browse/HBASE-14460> for > >> more details. > >> > >> Hope these information helps. > >> > >> Best Regards, > >> Yu > >> > >> On 18 March 2017 at 05:51, Vladimir Rodionov <[email protected]> > >> wrote: > >> > >> > >> In my opinion, 1M/s input data will result in only 70MByte/s > write > >> > > >> > Times 3 (default HDFS replication factor) Plus ... > >> > > >> > Do not forget about compaction read/write amplification. If you flush > >> 10 MB > >> > and your max region size is 10 GB, with default min file to compact > (3) > >> > your amplification is 6-7 That gives us 70 x 3 x 6 = 1260 MB/s > >> read/write > >> > or 210 MB/sec read and writes (210 MB/s reads and 210 MB/sec writes) > >> > > >> > per RS > >> > > >> > This IO load is way above sustainable. > >> > > >> > > >> > -Vlad > >> > > >> > > >> > On Fri, Mar 17, 2017 at 2:14 PM, Kevin O'Dell <[email protected]> > wrote: > >> > > >> > > Hey Hef, > >> > > > >> > > What is the memstore size setting(how much heap is it allowed) > that > >> you > >> > > have on that cluster? What is your region count per node? Are you > >> > writing > >> > > evenly across all those regions or are only a few regions active per > >> > region > >> > > server at a time? Can you paste your GC settings that you are > >> currently > >> > > using? > >> > > > >> > > On Fri, Mar 17, 2017 at 3:30 PM, Stack <[email protected]> wrote: > >> > > > >> > > > On Fri, Mar 17, 2017 at 9:31 AM, Hef <[email protected]> > wrote: > >> > > > > >> > > > > Hi group, > >> > > > > I'm using HBase to store large amount of time series data, the > >> usage > >> > > case > >> > > > > is heavy on writes then reads. My application stops at writing > >> 600k > >> > > > > requests per second and I can't tune up for better tps. > >> > > > > > >> > > > > Hardware: > >> > > > > I have 6 Region Servers, each has 128G memory, 12 HDDs, 2cores > >> with > >> > > > > 24threads, > >> > > > > > >> > > > > Schema: > >> > > > > The schema for these time series data is similar as OpenTSDB > that > >> the > >> > > > data > >> > > > > points of a same metric within an hour are store in one row, and > >> > there > >> > > > > could be maximum 3600 columns per row. > >> > > > > The cell is about 70bytes on its size, including the rowkey, > >> column > >> > > > > qualifier, column family and value. > >> > > > > > >> > > > > HBase config: > >> > > > > CDH 5.6 HBase 1.0.0 > >> > > > > > >> > > > > >> > > > Can you upgrade? There's a big diff between 1.2 and 1.0. > >> > > > > >> > > > > >> > > > > 100G memory for each RegionServer > >> > > > > hbase.hstore.compactionThreshold = 50 > >> > > > > hbase.hstore.blockingStoreFiles = 100 > >> > > > > hbase.hregion.majorcompaction disable > >> > > > > hbase.client.write.buffer = 20MB > >> > > > > hbase.regionserver.handler.count = 100 > >> > > > > > >> > > > > >> > > > Could try halving the handler count. > >> > > > > >> > > > > >> > > > > hbase.hregion.memstore.flush.size = 128MB > >> > > > > > >> > > > > > >> > > > > Why are you flushing? If it is because you are hitting this > flush > >> > > limit, > >> > > > can you try upping it? > >> > > > > >> > > > > >> > > > > >> > > > > HBase Client: > >> > > > > write in BufferedMutator with 100000/batch > >> > > > > > >> > > > > Inputs Volumes: > >> > > > > The input data throughput is more than 2millions/sec from Kafka > >> > > > > > >> > > > > > >> > > > How is the distribution? Evenly over the keyspace? > >> > > > > >> > > > > >> > > > > My writer applications are distributed, how ever I scaled them > up, > >> > the > >> > > > > total write throughput won't get larger than 600K/sec. > >> > > > > > >> > > > > >> > > > > >> > > > Tell us more about this scaling up? How many writers? > >> > > > > >> > > > > >> > > > > >> > > > > The severs have 20% CPU usage and 5.6 wa, > >> > > > > > >> > > > > >> > > > 5.6 is high enough. Is the i/o spread over the disks? > >> > > > > >> > > > > >> > > > > >> > > > > GC doesn't look good though, it shows a lot 10s+. > >> > > > > > >> > > > > > >> > > > What settings do you have? > >> > > > > >> > > > > >> > > > > >> > > > > In my opinion, 1M/s input data will result in only 70MByte/s > >> write > >> > > > > throughput to the cluster, which is quite a small amount compare > >> to > >> > > the 6 > >> > > > > region servers. The performance should not be bad like this. > >> > > > > > >> > > > > Is anybody has idea why the performance stops at 600K/s? > >> > > > > Is there anything I have to tune to increase the HBase write > >> > > throughput? > >> > > > > > >> > > > > >> > > > > >> > > > If you double the clients writing do you see an up in the > >> throughput? > >> > > > > >> > > > If you thread dump the servers, can you tell where they are held > >> up? Or > >> > > if > >> > > > they are doing any work at all relative? > >> > > > > >> > > > St.Ack > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Kevin O'Dell > >> > > Field Engineer > >> > > 850-496-1298 <(850)%20496-1298> | [email protected] > >> > > @kevinrodell > >> > > <http://www.rocana.com> > >> > > > >> > > >> > > > > >
