bq. I'm not sure if it's really a problem tho. Let's the maximum throughput achieved by writing with k client threads is 30 MB/sec, where k = the number of region servers. If you are consistently writing to HBase more than 30 MB/sec - lets say 40 MB/sec with 2k threads - that you can't use HBase replication and must write your own solution.
One way I started thinking about is to somehow declare that for a specific table, order of Puts is not important (say each write is unique), thus you can spawn multiple threads for replicating a WAL file. On Sat, Jun 22, 2013 at 12:18 AM, Jean-Daniel Cryans <[email protected]>wrote: > I think that the same way writing with more clients helped throughput, > writing with only 1 replication thread will hurt it. The clients in > both cases have to read something (a file from HDFS or the WAL) then > ship it, meaning that you can utilize the cluster better since a > single client isn't consistently writing. > > I agree with Asaf's assessment that it's possible that you can write > faster into HBase than you can replicate from it if your clients are > using the write buffers and have a bigger aggregate throughput than > replication's. > > I'm not sure if it's really a problem tho. > > J-D > > On Fri, Jun 21, 2013 at 3:05 PM, lars hofhansl <[email protected]> wrote: > > Hmm... Yes. Was worth a try :) Should've checked and I even wrote that > part of the code. > > > > I have no good explanation then, and also no good suggestion about how > to improve this. > > > > > > > > ________________________________ > > From: Asaf Mesika <[email protected]> > > To: "[email protected]" <[email protected]>; lars hofhansl < > [email protected]> > > Sent: Friday, June 21, 2013 5:50 AM > > Subject: Re: Replication not suited for intensive write applications? > > > > > > On Fri, Jun 21, 2013 at 2:38 PM, lars hofhansl <[email protected]> wrote: > > > >> Another thought... > >> > >> I assume you only write to a single table, right? How large are your > rows > >> on average? > >> > >> I'm writing to 2 tables: Avg row size for 1st table is 1500 bytes, and > the > > seconds around is around 800 bytes > > > >> > >> Replication will send 64mb blocks by default (or 25000 edits, whatever > is > >> smaller). The default HTable buffer is 2mb only, so the slave RS > receiving > >> a block of edits (assuming it is a full block), has to do 32 rounds of > >> splitting the edits per region in order to apply them. > >> > >> In the ReplicationSink.java (0.94.6) I see that HTable.batch() is used, > > which writes directly to RS without buffers? > > > > private void batch(byte[] tableName, List<Row> rows) throws > IOException { > > > > if (rows.isEmpty()) { > > > > return; > > > > } > > > > HTableInterface table = null; > > > > try { > > > > table = new HTable(tableName, this.sharedHtableCon, this. > > sharedThreadPool); > > > > table.batch(rows); > > > > this.metrics.appliedOpsRate.inc(rows.size()); > > > > } catch (InterruptedException ix) { > > > > throw new IOException(ix); > > > > } finally { > > > > if (table != null) { > > > > table.close(); > > > > } > > > > } > > > > } > > > > > > > >> > >> There is no setting specifically targeted at the buffer size for > >> replication, but maybe you could increase "hbase.client.write.buffer" to > >> 64mb (67108864) on the slave cluster and see whether that makes a > >> difference. If it does we can (1) add a setting to control the > >> ReplicationSink HTable's buffer size, or (2) just have it match the > >> replication buffer size "replication.source.size.capacity". > >> > >> > >> -- Lars > >> ________________________________ > >> From: lars hofhansl <[email protected]> > >> To: "[email protected]" <[email protected]> > >> Sent: Friday, June 21, 2013 1:48 AM > >> Subject: Re: Replication not suited for intensive write applications? > >> > >> > >> Thanks for checking... Interesting. So talking to 3RSs as opposed to > only > >> 1 before had no effect on the throughput? > >> > >> Would be good to explore this a bit more. > >> Since our RPC is not streaming, latency will effect throughout. In this > >> case there is latency while all edits are shipped to the RS in the slave > >> cluster and then extra latency when applying the edits there (which are > >> likely not local to that RS). A true streaming API should be better. If > >> that is the case compression *could* help (but that is a big if). > >> > >> The single thread shipping the edits to the slave should not be an issue > >> as the edits are actually applied by the slave RS, which will use > multiple > >> threads to apply the edits in the local cluster. > >> > >> Also my first reply - upon re-reading it - sounded a bit rough, that was > >> not intended. > >> > >> -- Lars > >> > >> > >> ----- Original Message ----- > >> From: Asaf Mesika <[email protected]> > >> To: "[email protected]" <[email protected]>; lars hofhansl < > >> [email protected]> > >> Cc: > >> Sent: Thursday, June 20, 2013 10:16 PM > >> Subject: Re: Replication not suited for intensive write applications? > >> > >> Thanks for the taking the time to answer! > >> My answers are inline. > >> > >> On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <[email protected]> > wrote: > >> > >> > I see. > >> > > >> > In HBase you have machines for both CPU (to serve requests) and > storage > >> > (to hold the data). > >> > > >> > If you only grow your cluster for CPU and you keep all RegionServers > 100% > >> > busy at all times, you are correct. > >> > > >> > Maybe you need to increase replication.source.size.capacity and/or > >> > replication.source.nb.capacity (although I doubt that this will help > >> here). > >> > > >> > I was thinking of giving a shot, but theoretically it should not > affect, > >> since I'm doing anything in parallel, right? > >> > >> > >> > Also a replication source will pick region server from the target at > >> > random (10% of them at default). That has two effects: > >> > 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1 > >> > 2. With such a small cluster setup the likelihood is high that two or > >> more > >> > RSs in the source will happen to pick the same RS at the target. Thus > >> > leading less throughput. > >> > > >> You are absolutely correct. In Graphite, in the beginning, I saw that > only > >> one slave RS was getting all replicateLogEntries RPC calls. I search the > >> master RS logs and saw "Choose Peer" as follows: > >> Master RS 74: Choose peer 83 > >> Master RS 75: Choose peer 83 > >> Master RS 76: Choose peer 85 > >> From some reason, they ALL talked to 83 (which seems like a bug to me). > >> > >> I thought I nailed the bottleneck, so I've changed the factor from 0.1 > to > >> 1. It had the exact you described, and now all RS were getting the same > >> amount of replicateLogEntries RPC calls, BUT it didn't budge the > >> replication throughput. When I checked the network card usage I > understood > >> that even when all 3 RS were talking to the same slave RS, network > wasn't > >> the bottleneck. > >> > >> > >> > > >> > In fact your numbers might indicate that two of your source RSs might > >> have > >> > picked the same target (you get 2/3 of your throughput via > replication). > >> > > >> > > >> > In any case, before drawing conclusions this should be tested with a > >> > larger cluster. > >> > Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs > >> will > >> > round robin all target RSs and lead to better distribution), but that > >> might > >> > have other side-effects, too. > >> > > >> I'll try getting two clusters of 10 RS each and see if that helps. I > >> suspect it won't. My hunch is that: since we're replicating with no more > >> than 10 threads, than if I take my client and set it to 10 threads and > >> measure the throughput, this will the maximum replication throughput. > Thus, > >> if my client will write with let's say 20 threads (or have two client > with > >> 10 threads each), than I'm bound to reach an ever increasing > >> ageOfLastShipped. > >> > >> > > >> > Did you measure the disk IO at each RS at the target? Maybe one of > them > >> is > >> > mostly idle. > >> > > >> > I didn't, but I did run my client directly at the slave cluster and > >> measure throughput and got 18 MB/sec which is bigger than the > replication > >> throughput of 11 MB/sec, thus I concluded hard drives couldn't be the > >> bottleneck here. > >> > >> I was thinking of somehow tweaking HBase a bit for my use case: I always > >> send Puts with new row KV (never update or delete), thus I have no > >> importance for ordering, thus maybe enable with a flag the ability, on a > >> certain column family to open multiple threads at the Replication > Source? > >> > >> One more question - keeping the one thread in mind here, having > compression > >> on the replicateLogEntries RPC call, shouldn't really help here right? > >> Since the entire RPC call time is mostly the time it takes to run the > >> HTable.batch call on the slave RS, right? If I enable compression > somehow > >> (hack HBase code to test drive it), I will only speed up transfer time > of > >> the batch to the slave RS, but still wait on the insertion of this batch > >> into the slave cluster. > >> > >> > >> > >> > >> > >> > >> > -- Lars > >> > ________________________________ > >> > From: Asaf Mesika <[email protected]> > >> > To: "[email protected]" <[email protected]>; lars hofhansl < > >> > [email protected]> > >> > Sent: Thursday, June 20, 2013 1:38 PM > >> > Subject: Re: Replication not suited for intensive write applications? > >> > > >> > > >> > Thanks for the answer! > >> > My responses are inline. > >> > > >> > On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <[email protected]> > >> wrote: > >> > > >> > > First off, this is a pretty constructed case leading to a specious > >> > general > >> > > conclusion. > >> > > > >> > > If you only have three RSs/DNs and the default replication factor > of 3, > >> > > each machine will get every single write. > >> > > That is the first issue. Using HBase makes little sense with such a > >> small > >> > > cluster. > >> > > > >> > You are correct, non the less - network as I measured, was far from > its > >> > capacity thus probably not the bottleneck. > >> > > >> > > > >> > > Secondly, as you say yourself, there are only three regionservers > >> writing > >> > > to the replicated cluster using a single thread each in order to > >> preserve > >> > > ordering. > >> > > With more region servers your scale will tip the other way. Again > more > >> > > regionservers will make this better. > >> > > > >> > > I presume, in production, I will add more region servers to > accommodate > >> > growing write demand on my cluster. Hence, my clients will write with > >> more > >> > threads. Thus proportionally I will always have a lot more client > threads > >> > than the number of region servers (each has one replication thread). > So, > >> I > >> > don't see how adding more region servers will tip the scale to other > >> side. > >> > The only way to avoid this, is to design the cluster in such a way > that > >> if > >> > I can handle the events received at the client which write them to > HBase > >> > with x Threads, this is the amount of region servers I should have. > If I > >> > will have a spike, then it will even out eventually, but this under > >> > utilizing my cluster hardware, no? > >> > > >> > > >> > > As for your other question, more threads can lead to better > >> interleaving > >> > > of CPU and IO, thus leading to better throughput (this relationship > is > >> > not > >> > > linear, though). > >> > > > >> > > > >> > > >> > > > >> > > -- Lars > >> > > > >> > > > >> > > > >> > > ----- Original Message ----- > >> > > From: Asaf Mesika <[email protected]> > >> > > To: "[email protected]" <[email protected]> > >> > > Cc: > >> > > Sent: Thursday, June 20, 2013 3:46 AM > >> > > Subject: Replication not suited for intensive write applications? > >> > > > >> > > Hi, > >> > > > >> > > I've been conducting lots of benchmarks to test the maximum > throughput > >> of > >> > > replication in HBase. > >> > > > >> > > I've come to the conclusion that HBase replication is not suited for > >> > write > >> > > intensive application. I hope that people here can show me where I'm > >> > wrong. > >> > > > >> > > *My setup* > >> > > *Cluster (*Master and slave are alike) > >> > > 1 Master, NameNode > >> > > 3 RS, Data Node > >> > > > >> > > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit > >> > ethernet > >> > > card > >> > > > >> > > I insert data into HBase from a java process (client) reading files > >> from > >> > > disk, running on the machine running the HBase Master in the master > >> > > cluster. > >> > > > >> > > *Benchmark Results* > >> > > When the client writes with 10 Threads, then the master cluster > writes > >> at > >> > > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The > data > >> > size > >> > > I wrote is 15 GB, all Puts, to two different tables. > >> > > Both clusters when tested independently without replication, > achieved > >> > write > >> > > throughput of 17-19 MB/sec, so evidently the replication process is > the > >> > > bottleneck. > >> > > > >> > > I also tested connectivity between the two clusters using "netcat" > and > >> > > achieved 111 MB/sec. > >> > > I've checked the usage of the network cards both on the client, > master > >> > > cluster region server and slave region servers. No computer when > over > >> > > 30mb/sec in Receive or Transmit. > >> > > The way I checked was rather crud but works: I've run "netstat -ie" > >> > before > >> > > HBase in the master cluster starts writing and after it finishes. > The > >> > same > >> > > was done on the replicated cluster (when the replication started and > >> > > finished). I can tell the amount of bytes Received and Transmitted > and > >> I > >> > > know that duration each cluster worked, thus I can calculate the > >> > > throughput. > >> > > > >> > > *The bottleneck in my opinion* > >> > > Since we've excluded network capacity, and each cluster works at > faster > >> > > rate independently, all is left is the replication process. > >> > > My client writes to the master cluster with 10 Threads, and manages > to > >> > > write at 17-18 MB/sec. > >> > > Each region server has only 1 thread responsible for transmitting > the > >> > data > >> > > written to the WAL to the slave cluster. Thus in my setup I > effectively > >> > > have 3 threads writing to the slave cluster. Thus this is the > >> > bottleneck, > >> > > since this process can not be parallelized, since it must transmit > the > >> > WAL > >> > > in a certain order. > >> > > > >> > > *Conclusion* > >> > > When writes intensively to HBase with more than 3 threads (in my > >> setup), > >> > > you can't use replication. > >> > > > >> > > *Master throughput without replication* > >> > > On a different note, I have one thing I couldn't understand at all. > >> > > When turned off replication, and wrote with my client with 3 > threads I > >> > got > >> > > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more > than > >> > that > >> > > doesn't help) I achieved maximum throughput of 19 MB/sec. > >> > > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on > each > >> > RS, > >> > > thus the network capacity was not the bottleneck. > >> > > On the HBase master machine which ran the client, the network card > >> again > >> > > showed Receive throughput of 0.5MB/sec and Transmit throughput of > 18.28 > >> > > MB/sec. Hence it's the client machine network card creating the > >> > bottleneck. > >> > > > >> > > The only explanation I have is the synchronized writes to the WAL. > >> Those > >> > 10 > >> > > threads have to get in line, and one by one, write their batch of > Puts > >> to > >> > > the WAL, which creates a bottleneck. > >> > > > >> > > *My question*: > >> > > The one thing I couldn't understand is: When I write with 3 Threads, > >> > > meaning I have no more than 3 concurrent RPC requests to write in > each > >> > RS. > >> > > They achieved 11.3 MB/sec. > >> > > The write to the WAL is synchronized, so why increasing the number > of > >> > > threads to 10 (x3 more) actually increased the throughput to 19 > MB/sec? > >> > > They all get in line to write to the same location, so it seems have > >> > > concurrent write shouldn't improve throughput at all. > >> > > > >> > > > >> > > Thanks you! > >> > > > >> > > Asaf > >> > > * > >> > > * > >> > > > >> > > > >> > > >> >
