Hmm... Yes. Was worth a try :) Should've checked and I even wrote that part of the code.
I have no good explanation then, and also no good suggestion about how to improve this. ________________________________ From: Asaf Mesika <[email protected]> To: "[email protected]" <[email protected]>; lars hofhansl <[email protected]> Sent: Friday, June 21, 2013 5:50 AM Subject: Re: Replication not suited for intensive write applications? On Fri, Jun 21, 2013 at 2:38 PM, lars hofhansl <[email protected]> wrote: > Another thought... > > I assume you only write to a single table, right? How large are your rows > on average? > > I'm writing to 2 tables: Avg row size for 1st table is 1500 bytes, and the seconds around is around 800 bytes > > Replication will send 64mb blocks by default (or 25000 edits, whatever is > smaller). The default HTable buffer is 2mb only, so the slave RS receiving > a block of edits (assuming it is a full block), has to do 32 rounds of > splitting the edits per region in order to apply them. > > In the ReplicationSink.java (0.94.6) I see that HTable.batch() is used, which writes directly to RS without buffers? private void batch(byte[] tableName, List<Row> rows) throws IOException { if (rows.isEmpty()) { return; } HTableInterface table = null; try { table = new HTable(tableName, this.sharedHtableCon, this. sharedThreadPool); table.batch(rows); this.metrics.appliedOpsRate.inc(rows.size()); } catch (InterruptedException ix) { throw new IOException(ix); } finally { if (table != null) { table.close(); } } } > > There is no setting specifically targeted at the buffer size for > replication, but maybe you could increase "hbase.client.write.buffer" to > 64mb (67108864) on the slave cluster and see whether that makes a > difference. If it does we can (1) add a setting to control the > ReplicationSink HTable's buffer size, or (2) just have it match the > replication buffer size "replication.source.size.capacity". > > > -- Lars > ________________________________ > From: lars hofhansl <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Friday, June 21, 2013 1:48 AM > Subject: Re: Replication not suited for intensive write applications? > > > Thanks for checking... Interesting. So talking to 3RSs as opposed to only > 1 before had no effect on the throughput? > > Would be good to explore this a bit more. > Since our RPC is not streaming, latency will effect throughout. In this > case there is latency while all edits are shipped to the RS in the slave > cluster and then extra latency when applying the edits there (which are > likely not local to that RS). A true streaming API should be better. If > that is the case compression *could* help (but that is a big if). > > The single thread shipping the edits to the slave should not be an issue > as the edits are actually applied by the slave RS, which will use multiple > threads to apply the edits in the local cluster. > > Also my first reply - upon re-reading it - sounded a bit rough, that was > not intended. > > -- Lars > > > ----- Original Message ----- > From: Asaf Mesika <[email protected]> > To: "[email protected]" <[email protected]>; lars hofhansl < > [email protected]> > Cc: > Sent: Thursday, June 20, 2013 10:16 PM > Subject: Re: Replication not suited for intensive write applications? > > Thanks for the taking the time to answer! > My answers are inline. > > On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <[email protected]> wrote: > > > I see. > > > > In HBase you have machines for both CPU (to serve requests) and storage > > (to hold the data). > > > > If you only grow your cluster for CPU and you keep all RegionServers 100% > > busy at all times, you are correct. > > > > Maybe you need to increase replication.source.size.capacity and/or > > replication.source.nb.capacity (although I doubt that this will help > here). > > > > I was thinking of giving a shot, but theoretically it should not affect, > since I'm doing anything in parallel, right? > > > > Also a replication source will pick region server from the target at > > random (10% of them at default). That has two effects: > > 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1 > > 2. With such a small cluster setup the likelihood is high that two or > more > > RSs in the source will happen to pick the same RS at the target. Thus > > leading less throughput. > > > You are absolutely correct. In Graphite, in the beginning, I saw that only > one slave RS was getting all replicateLogEntries RPC calls. I search the > master RS logs and saw "Choose Peer" as follows: > Master RS 74: Choose peer 83 > Master RS 75: Choose peer 83 > Master RS 76: Choose peer 85 > From some reason, they ALL talked to 83 (which seems like a bug to me). > > I thought I nailed the bottleneck, so I've changed the factor from 0.1 to > 1. It had the exact you described, and now all RS were getting the same > amount of replicateLogEntries RPC calls, BUT it didn't budge the > replication throughput. When I checked the network card usage I understood > that even when all 3 RS were talking to the same slave RS, network wasn't > the bottleneck. > > > > > > In fact your numbers might indicate that two of your source RSs might > have > > picked the same target (you get 2/3 of your throughput via replication). > > > > > > In any case, before drawing conclusions this should be tested with a > > larger cluster. > > Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs > will > > round robin all target RSs and lead to better distribution), but that > might > > have other side-effects, too. > > > I'll try getting two clusters of 10 RS each and see if that helps. I > suspect it won't. My hunch is that: since we're replicating with no more > than 10 threads, than if I take my client and set it to 10 threads and > measure the throughput, this will the maximum replication throughput. Thus, > if my client will write with let's say 20 threads (or have two client with > 10 threads each), than I'm bound to reach an ever increasing > ageOfLastShipped. > > > > > Did you measure the disk IO at each RS at the target? Maybe one of them > is > > mostly idle. > > > > I didn't, but I did run my client directly at the slave cluster and > measure throughput and got 18 MB/sec which is bigger than the replication > throughput of 11 MB/sec, thus I concluded hard drives couldn't be the > bottleneck here. > > I was thinking of somehow tweaking HBase a bit for my use case: I always > send Puts with new row KV (never update or delete), thus I have no > importance for ordering, thus maybe enable with a flag the ability, on a > certain column family to open multiple threads at the Replication Source? > > One more question - keeping the one thread in mind here, having compression > on the replicateLogEntries RPC call, shouldn't really help here right? > Since the entire RPC call time is mostly the time it takes to run the > HTable.batch call on the slave RS, right? If I enable compression somehow > (hack HBase code to test drive it), I will only speed up transfer time of > the batch to the slave RS, but still wait on the insertion of this batch > into the slave cluster. > > > > > > > > -- Lars > > ________________________________ > > From: Asaf Mesika <[email protected]> > > To: "[email protected]" <[email protected]>; lars hofhansl < > > [email protected]> > > Sent: Thursday, June 20, 2013 1:38 PM > > Subject: Re: Replication not suited for intensive write applications? > > > > > > Thanks for the answer! > > My responses are inline. > > > > On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <[email protected]> > wrote: > > > > > First off, this is a pretty constructed case leading to a specious > > general > > > conclusion. > > > > > > If you only have three RSs/DNs and the default replication factor of 3, > > > each machine will get every single write. > > > That is the first issue. Using HBase makes little sense with such a > small > > > cluster. > > > > > You are correct, non the less - network as I measured, was far from its > > capacity thus probably not the bottleneck. > > > > > > > > Secondly, as you say yourself, there are only three regionservers > writing > > > to the replicated cluster using a single thread each in order to > preserve > > > ordering. > > > With more region servers your scale will tip the other way. Again more > > > regionservers will make this better. > > > > > > I presume, in production, I will add more region servers to accommodate > > growing write demand on my cluster. Hence, my clients will write with > more > > threads. Thus proportionally I will always have a lot more client threads > > than the number of region servers (each has one replication thread). So, > I > > don't see how adding more region servers will tip the scale to other > side. > > The only way to avoid this, is to design the cluster in such a way that > if > > I can handle the events received at the client which write them to HBase > > with x Threads, this is the amount of region servers I should have. If I > > will have a spike, then it will even out eventually, but this under > > utilizing my cluster hardware, no? > > > > > > > As for your other question, more threads can lead to better > interleaving > > > of CPU and IO, thus leading to better throughput (this relationship is > > not > > > linear, though). > > > > > > > > > > > > > > -- Lars > > > > > > > > > > > > ----- Original Message ----- > > > From: Asaf Mesika <[email protected]> > > > To: "[email protected]" <[email protected]> > > > Cc: > > > Sent: Thursday, June 20, 2013 3:46 AM > > > Subject: Replication not suited for intensive write applications? > > > > > > Hi, > > > > > > I've been conducting lots of benchmarks to test the maximum throughput > of > > > replication in HBase. > > > > > > I've come to the conclusion that HBase replication is not suited for > > write > > > intensive application. I hope that people here can show me where I'm > > wrong. > > > > > > *My setup* > > > *Cluster (*Master and slave are alike) > > > 1 Master, NameNode > > > 3 RS, Data Node > > > > > > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit > > ethernet > > > card > > > > > > I insert data into HBase from a java process (client) reading files > from > > > disk, running on the machine running the HBase Master in the master > > > cluster. > > > > > > *Benchmark Results* > > > When the client writes with 10 Threads, then the master cluster writes > at > > > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data > > size > > > I wrote is 15 GB, all Puts, to two different tables. > > > Both clusters when tested independently without replication, achieved > > write > > > throughput of 17-19 MB/sec, so evidently the replication process is the > > > bottleneck. > > > > > > I also tested connectivity between the two clusters using "netcat" and > > > achieved 111 MB/sec. > > > I've checked the usage of the network cards both on the client, master > > > cluster region server and slave region servers. No computer when over > > > 30mb/sec in Receive or Transmit. > > > The way I checked was rather crud but works: I've run "netstat -ie" > > before > > > HBase in the master cluster starts writing and after it finishes. The > > same > > > was done on the replicated cluster (when the replication started and > > > finished). I can tell the amount of bytes Received and Transmitted and > I > > > know that duration each cluster worked, thus I can calculate the > > > throughput. > > > > > > *The bottleneck in my opinion* > > > Since we've excluded network capacity, and each cluster works at faster > > > rate independently, all is left is the replication process. > > > My client writes to the master cluster with 10 Threads, and manages to > > > write at 17-18 MB/sec. > > > Each region server has only 1 thread responsible for transmitting the > > data > > > written to the WAL to the slave cluster. Thus in my setup I effectively > > > have 3 threads writing to the slave cluster. Thus this is the > > bottleneck, > > > since this process can not be parallelized, since it must transmit the > > WAL > > > in a certain order. > > > > > > *Conclusion* > > > When writes intensively to HBase with more than 3 threads (in my > setup), > > > you can't use replication. > > > > > > *Master throughput without replication* > > > On a different note, I have one thing I couldn't understand at all. > > > When turned off replication, and wrote with my client with 3 threads I > > got > > > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than > > that > > > doesn't help) I achieved maximum throughput of 19 MB/sec. > > > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each > > RS, > > > thus the network capacity was not the bottleneck. > > > On the HBase master machine which ran the client, the network card > again > > > showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28 > > > MB/sec. Hence it's the client machine network card creating the > > bottleneck. > > > > > > The only explanation I have is the synchronized writes to the WAL. > Those > > 10 > > > threads have to get in line, and one by one, write their batch of Puts > to > > > the WAL, which creates a bottleneck. > > > > > > *My question*: > > > The one thing I couldn't understand is: When I write with 3 Threads, > > > meaning I have no more than 3 concurrent RPC requests to write in each > > RS. > > > They achieved 11.3 MB/sec. > > > The write to the WAL is synchronized, so why increasing the number of > > > threads to 10 (x3 more) actually increased the throughput to 19 MB/sec? > > > They all get in line to write to the same location, so it seems have > > > concurrent write shouldn't improve throughput at all. > > > > > > > > > Thanks you! > > > > > > Asaf > > > * > > > * > > > > > > > > >
