Re: how to optimize for heavy writes scenario

Dejan Menges Tue, 21 Mar 2017 08:55:30 -0700

Regarding du -sk, take a look here
https://issues.apache.org/jira/browse/HADOOP-9884


Also hardly waiting for this one to be fixed.

On Tue, Mar 21, 2017 at 4:09 PM Hef <[email protected]> wrote:

> There were several curious things we have observed:
> One the region servers, there were abnormal much more reads than writes:
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sda             608.00      6552.00         0.00       6552          0
> sdb             345.00      2692.00     78868.00       2692      78868
> sdc             406.00     14548.00     63960.00      14548      63960
> sdd               2.00         0.00        32.00          0         32
> sde              62.00      8764.00         0.00       8764          0
> sdf             498.00     11100.00        32.00      11100         32
> sdg            2080.00     11712.00         0.00      11712          0
> sdh             109.00      5072.00         0.00       5072          0
> sdi             158.00         4.00     32228.00          4      32228
> sdj              43.00      5648.00        32.00       5648         32
> sdk             255.00      3784.00         0.00       3784          0
> sdl              86.00      1412.00      9176.00       1412       9176
>
> In CDH region server dashboard, the Average Disk IOPS for writes were
> stable on 735/s, while the reads raised from 900/s to 5000/s every 5
> minutes.
>
> iotop shown the following processes were eating the most io:
>  6447 be/4 hdfs        2.70 M/s    0.00 B/s  0.00 % 94.54 % du -sk
> /data/12/dfs/dn/curre~632-10.1.1.100-1457937043486
>  6023 be/4 hdfs        2.54 M/s    0.00 B/s  0.00 % 92.14 % du -sk
> /data/9/dfs/dn/curren~632-10.1.1.100-1457937043486
>  6186 be/4 hdfs     1379.58 K/s    0.00 B/s  0.00 % 90.78 % du -sk
> /data/11/dfs/dn/curre~632-10.1.1.100-1457937043486
>
> What were all this reading for? And what are thos du -sk processes? Could
> this be a reason to slow down the write throughput?
>
>
>
> On Tue, Mar 21, 2017 at 7:48 PM, Hef <[email protected]> wrote:
>
> > Hi guys,
> > Thanks for all your hints.
> > Let me summarize the tuning I have done these days.
> > Initially, before tuning, HBase cluster worked at an average write tps of
> > 400k tps (600k tps at max). The total network TX throughputs from
> > clients(aggregated from multiple servers) to RegionServers  shown 300Mb/s
> > in average.
> >
> > I adopted the following steps for tuning:
> > 1. optimized the HBase schema for our table, deducted the cells size by
> > 40%.
> >     Result:
> >     failed,  tps not obviously increased
> >
> > 2. Recreated the table by more evenly distribution of pre-split keyspace
> >     Result:
> >     failed, tps not obviously increased
> >
> > 3. Adjusted RS GC strategy:
> >     Before:
> >         -XX:+UseParNewGC
> >         -XX:+UseConcMarkSweepGC
> >         -XX:CMSInitiatingOccupancyFraction=70
> >         -XX:+CMSParallelRemarkEnabled
> >         -Xmx100g
> >         -Xms100g
> >         -Xmn20g
> >
> >     After:
> >         -XX:+UseG1GC
> >         -XX:+UnlockExperimentalVMOptions
> >         -XX:MaxGCPauseMillis=50
> >         -XX:-OmitStackTraceInFastThrow
> >         -XX:ParallelGCThreads=18
> >         -XX:+ParallelRefProcEnabled
> >         -XX:+PerfDisableSharedMem
> >         -XX:-ResizePLAB
> >         -XX:G1NewSizePercent=8
> >         -Xms100G -Xmx100G
> >         -XX:MaxTenuringThreshold=1
> >         -XX:G1HeapWastePercent=10
> >         -XX:G1MixedGCCountTarget=16
> >         -XX:G1HeapRegionSize=32M
> >
> >     Result:
> >     Success. GC pause time reduced, tps increased by at least 10%
> >
> > 4. Upgraded to CDH5.9.1 HBase 1.2, also updated client lib to HBase1.2
> >     Success:
> >     1. total client TX  throughput raised to 700Mb/s
> >     2. HBase write tps raised to 600k/s in average and 800k/s at max
> >
> > 5. Other configurations:
> >     hbase.hstore.compactionThreshold = 10
> >     hbase.hstore.blockingStoreFiles = 300
> >     hbase.hstore.compaction.max = 20
> >     hbase.regionserver.thread.compaction.small = 30
> >
> >     hbase.hregion.memstore.flush.size = 128
> >     hbase.regionserver.global.memstore.lowerLimit = 0.3
> >     hbase.regionserver.global.memstore.upperLimit = 0.7
> >
> >     hbase.regionserver.maxlogs = 100
> >     hbase.wal.regiongrouping.numgroups = 5
> >     hbase.wal.provider = Multiple HDFS WAL
> >
> >
> >
> > Summary:
> >     1. HBase 1.2 does have better performance than 1.0
> >     2. 300k/s tps per RegionServer still looks not satisfied, as I can
> see
> > the CPU/network/IO/memory  still have a lot idle resources.
> >         Per RS:
> >         1. CPU 50% used (Not sure why cpu is so high for only 300K writer
> > requests)
> >         2. JVM Heap, 40% used
> >         3. total disks throughput over 12 HDDs, 91MB/s on write and
> 40MB/s
> > on read
> >         4. Network in/out 560Mb/s on 1G NIC
> >
> >
> > Further questions:
> > Does anyone confront a similiar heavy write scenario like this?
> > How much concurrent writes can a RegionServer handle?  Can any one share
> > how much tps can your RS reach at max?
> >
> > Thanks
> > Hef
> >
> >
> >
> >
> >
> >
> > On Sat, Mar 18, 2017 at 1:11 PM, Yu Li <[email protected]> wrote:
> >
> >> First please try out stack's suggestion, all good ones.
> >>
> >> And some supplement: since all disks in use are HDD w/ normal IO
> >> capability, it's important to control big IO rate like flush and
> >> compaction. Try below features out:
> >> 1. HBASE-8329 <https://issues.apache.org/jira/browse/HBASE-8329>: Limit
> >> compaction speed (available in 1.1.0+)
> >> 2. HBASE-14969 <https://issues.apache.org/jira/browse/HBASE-14969>: Add
> >> throughput controller for flush (available in 1.3.0)
> >> 3. HBASE-10201 <https://issues.apache.org/jira/browse/HBASE-10201>: Per
> >> column family flush (available in 1.1.0+)
> >>     * HBASE-14906 <https://issues.apache.org/jira/browse/HBASE-14906>:
> >> Improvements on FlushLargeStoresPolicy (only available in 2.0, not
> >> released
> >> yet)
> >>
> >> Also try out multiple WAL, we observed ~20% write perf boost in prod.
> See
> >> more details in the doc attached in below JIRA:
> >> - HBASE-14457 <https://issues.apache.org/jira/browse/HBASE-14457>:
> >> Umbrella:
> >> Improve Multiple WAL for production usage
> >>
> >> And please note that if you decided to pick up a branch-1.1 release,
> make
> >> sure to use 1.1.3+, or you may hit some perf regression issue on writes,
> >> see HBASE-14460 <https://issues.apache.org/jira/browse/HBASE-14460> for
> >> more details.
> >>
> >> Hope these information helps.
> >>
> >> Best Regards,
> >> Yu
> >>
> >> On 18 March 2017 at 05:51, Vladimir Rodionov <[email protected]>
> >> wrote:
> >>
> >> > >> In my opinion,  1M/s input data will result in only  70MByte/s
> write
> >> >
> >> > Times 3 (default HDFS replication factor) Plus ...
> >> >
> >> > Do not forget about compaction read/write amplification. If you flush
> >> 10 MB
> >> > and your max region size is 10 GB, with default min file to compact
> (3)
> >> > your amplification is 6-7 That gives us 70 x 3 x 6 = 1260 MB/s
> >> read/write
> >> > or 210 MB/sec read and writes (210 MB/s reads and 210 MB/sec writes)
> >> >
> >> > per RS
> >> >
> >> > This IO load is way above sustainable.
> >> >
> >> >
> >> > -Vlad
> >> >
> >> >
> >> > On Fri, Mar 17, 2017 at 2:14 PM, Kevin O'Dell <[email protected]>
> wrote:
> >> >
> >> > > Hey Hef,
> >> > >
> >> > >   What is the memstore size setting(how much heap is it allowed)
> that
> >> you
> >> > > have on that cluster?  What is your region count per node?  Are you
> >> > writing
> >> > > evenly across all those regions or are only a few regions active per
> >> > region
> >> > > server at a time?  Can you paste your GC settings that you are
> >> currently
> >> > > using?
> >> > >
> >> > > On Fri, Mar 17, 2017 at 3:30 PM, Stack <[email protected]> wrote:
> >> > >
> >> > > > On Fri, Mar 17, 2017 at 9:31 AM, Hef <[email protected]>
> wrote:
> >> > > >
> >> > > > > Hi group,
> >> > > > > I'm using HBase to store large amount of time series data, the
> >> usage
> >> > > case
> >> > > > > is heavy on writes then reads. My application stops at writing
> >> 600k
> >> > > > > requests per second and I can't tune up for better tps.
> >> > > > >
> >> > > > > Hardware:
> >> > > > > I have 6 Region Servers, each has 128G memory, 12 HDDs, 2cores
> >> with
> >> > > > > 24threads,
> >> > > > >
> >> > > > > Schema:
> >> > > > > The schema for these time series data is similar as OpenTSDB
> that
> >> the
> >> > > > data
> >> > > > > points of a same metric within an hour are store in one row, and
> >> > there
> >> > > > > could be maximum 3600 columns per row.
> >> > > > > The cell is about 70bytes on its size, including the rowkey,
> >> column
> >> > > > > qualifier, column family and value.
> >> > > > >
> >> > > > > HBase config:
> >> > > > > CDH 5.6 HBase 1.0.0
> >> > > > >
> >> > > >
> >> > > > Can you upgrade? There's a big diff between 1.2 and 1.0.
> >> > > >
> >> > > >
> >> > > > > 100G memory for each RegionServer
> >> > > > > hbase.hstore.compactionThreshold = 50
> >> > > > > hbase.hstore.blockingStoreFiles = 100
> >> > > > > hbase.hregion.majorcompaction disable
> >> > > > > hbase.client.write.buffer = 20MB
> >> > > > > hbase.regionserver.handler.count = 100
> >> > > > >
> >> > > >
> >> > > > Could try halving the handler count.
> >> > > >
> >> > > >
> >> > > > > hbase.hregion.memstore.flush.size = 128MB
> >> > > > >
> >> > > > >
> >> > > > > Why are you flushing? If it is because you are hitting this
> flush
> >> > > limit,
> >> > > > can you try upping it?
> >> > > >
> >> > > >
> >> > > >
> >> > > > > HBase Client:
> >> > > > > write in BufferedMutator with 100000/batch
> >> > > > >
> >> > > > > Inputs Volumes:
> >> > > > > The input data throughput is more than 2millions/sec from Kafka
> >> > > > >
> >> > > > >
> >> > > > How is the distribution? Evenly over the keyspace?
> >> > > >
> >> > > >
> >> > > > > My writer applications are distributed, how ever I scaled them
> up,
> >> > the
> >> > > > > total write throughput won't get larger than 600K/sec.
> >> > > > >
> >> > > >
> >> > > >
> >> > > > Tell us more about this scaling up? How many writers?
> >> > > >
> >> > > >
> >> > > >
> >> > > > > The severs have 20% CPU usage and 5.6 wa,
> >> > > > >
> >> > > >
> >> > > > 5.6 is high enough. Is the i/o spread over the disks?
> >> > > >
> >> > > >
> >> > > >
> >> > > > > GC  doesn't look good though, it shows a lot 10s+.
> >> > > > >
> >> > > > >
> >> > > > What settings do you have?
> >> > > >
> >> > > >
> >> > > >
> >> > > > > In my opinion,  1M/s input data will result in only  70MByte/s
> >> write
> >> > > > > throughput to the cluster, which is quite a small amount compare
> >> to
> >> > > the 6
> >> > > > > region servers. The performance should not be bad like this.
> >> > > > >
> >> > > > > Is anybody has idea why the performance stops at 600K/s?
> >> > > > > Is there anything I have to tune to increase the HBase write
> >> > > throughput?
> >> > > > >
> >> > > >
> >> > > >
> >> > > > If you double the clients writing do you see an up in the
> >> throughput?
> >> > > >
> >> > > > If you thread dump the servers, can you tell where they are held
> >> up? Or
> >> > > if
> >> > > > they are doing any work at all relative?
> >> > > >
> >> > > > St.Ack
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Kevin O'Dell
> >> > > Field Engineer
> >> > > 850-496-1298 <(850)%20496-1298> | [email protected]
> >> > > @kevinrodell
> >> > > <http://www.rocana.com>
> >> > >
> >> >
> >>
> >
> >
>

Re: how to optimize for heavy writes scenario

Reply via email to