Re: Replication not suited for intensive write applications?

Asaf Mesika Sat, 22 Jun 2013 23:34:19 -0700

bq. I'm not sure if it's really a problem tho.

Let's the maximum throughput achieved by writing with k client threads is
30 MB/sec, where k = the number of region servers.
If you are consistently writing to HBase more than 30 MB/sec  - lets say 40
MB/sec with 2k threads - that you can't use HBase replication and must
write your own solution.


One way I started thinking about is to somehow declare that for a specific
table, order of Puts is not important (say each write is unique), thus you
can spawn multiple threads for replicating a WAL file.




On Sat, Jun 22, 2013 at 12:18 AM, Jean-Daniel Cryans <[email protected]>wrote:

> I think that the same way writing with more clients helped throughput,
> writing with only 1 replication thread will hurt it. The clients in
> both cases have to read something (a file from HDFS or the WAL) then
> ship it, meaning that you can utilize the cluster better since a
> single client isn't consistently writing.
>
> I agree with Asaf's assessment that it's possible that you can write
> faster into HBase than you can replicate from it if your clients are
> using the write buffers and have a bigger aggregate throughput than
> replication's.
>
> I'm not sure if it's really a problem tho.
>
> J-D
>
> On Fri, Jun 21, 2013 at 3:05 PM, lars hofhansl <[email protected]> wrote:
> > Hmm... Yes. Was worth a try :)  Should've checked and I even wrote that
> part of the code.
> >
> > I have no good explanation then, and also no good suggestion about how
> to improve this.
> >
> >
> >
> > ________________________________
> >  From: Asaf Mesika <[email protected]>
> > To: "[email protected]" <[email protected]>; lars hofhansl <
> [email protected]>
> > Sent: Friday, June 21, 2013 5:50 AM
> > Subject: Re: Replication not suited for intensive write applications?
> >
> >
> > On Fri, Jun 21, 2013 at 2:38 PM, lars hofhansl <[email protected]> wrote:
> >
> >> Another thought...
> >>
> >> I assume you only write to a single table, right? How large are your
> rows
> >> on average?
> >>
> >> I'm writing to 2 tables: Avg row size for 1st table is 1500 bytes, and
> the
> > seconds around is around 800 bytes
> >
> >>
> >> Replication will send 64mb blocks by default (or 25000 edits, whatever
> is
> >> smaller). The default HTable buffer is 2mb only, so the slave RS
> receiving
> >> a block of edits (assuming it is a full block), has to do 32 rounds of
> >> splitting the edits per region in order to apply them.
> >>
> >> In the ReplicationSink.java (0.94.6) I see that HTable.batch() is used,
> > which writes directly to RS without buffers?
> >
> >   private void batch(byte[] tableName, List<Row> rows) throws
> IOException {
> >
> >     if (rows.isEmpty()) {
> >
> >       return;
> >
> >     }
> >
> >     HTableInterface table = null;
> >
> >     try {
> >
> >       table = new HTable(tableName, this.sharedHtableCon, this.
> > sharedThreadPool);
> >
> >       table.batch(rows);
> >
> >       this.metrics.appliedOpsRate.inc(rows.size());
> >
> >     } catch (InterruptedException ix) {
> >
> >       throw new IOException(ix);
> >
> >     } finally {
> >
> >       if (table != null) {
> >
> >         table.close();
> >
> >       }
> >
> >     }
> >
> >   }
> >
> >
> >
> >>
> >> There is no setting specifically targeted at the buffer size for
> >> replication, but maybe you could increase "hbase.client.write.buffer" to
> >> 64mb (67108864) on the slave cluster and see whether that makes a
> >> difference. If it does we can (1) add a setting to control the
> >> ReplicationSink HTable's buffer size, or (2) just have it match the
> >> replication buffer size "replication.source.size.capacity".
> >>
> >>
> >> -- Lars
> >> ________________________________
> >> From: lars hofhansl <[email protected]>
> >> To: "[email protected]" <[email protected]>
> >> Sent: Friday, June 21, 2013 1:48 AM
> >> Subject: Re: Replication not suited for intensive write applications?
> >>
> >>
> >> Thanks for checking... Interesting. So talking to 3RSs as opposed to
> only
> >> 1 before had no effect on the throughput?
> >>
> >> Would be good to explore this a bit more.
> >> Since our RPC is not streaming, latency will effect throughout. In this
> >> case there is latency while all edits are shipped to the RS in the slave
> >> cluster and then extra latency when applying the edits there (which are
> >> likely not local to that RS). A true streaming API should be better. If
> >> that is the case compression *could* help (but that is a big if).
> >>
> >> The single thread shipping the edits to the slave should not be an issue
> >> as the edits are actually applied by the slave RS, which will use
> multiple
> >> threads to apply the edits in the local cluster.
> >>
> >> Also my first reply - upon re-reading it - sounded a bit rough, that was
> >> not intended.
> >>
> >> -- Lars
> >>
> >>
> >> ----- Original Message -----
> >> From: Asaf Mesika <[email protected]>
> >> To: "[email protected]" <[email protected]>; lars hofhansl <
> >> [email protected]>
> >> Cc:
> >> Sent: Thursday, June 20, 2013 10:16 PM
> >> Subject: Re: Replication not suited for intensive write applications?
> >>
> >> Thanks for the taking the time to answer!
> >> My answers are inline.
> >>
> >> On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <[email protected]>
> wrote:
> >>
> >> > I see.
> >> >
> >> > In HBase you have machines for both CPU (to serve requests) and
> storage
> >> > (to hold the data).
> >> >
> >> > If you only grow your cluster for CPU and you keep all RegionServers
> 100%
> >> > busy at all times, you are correct.
> >> >
> >> > Maybe you need to increase replication.source.size.capacity and/or
> >> > replication.source.nb.capacity (although I doubt that this will help
> >> here).
> >> >
> >> > I was thinking of giving a shot, but theoretically it should not
> affect,
> >> since I'm doing anything in parallel, right?
> >>
> >>
> >> > Also a replication source will pick region server from the target at
> >> > random (10% of them at default). That has two effects:
> >> > 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
> >> > 2. With such a small cluster setup the likelihood is high that two or
> >> more
> >> > RSs in the source will happen to pick the same RS at the target. Thus
> >> > leading less throughput.
> >> >
> >> You are absolutely correct. In Graphite, in the beginning, I saw that
> only
> >> one slave RS was getting all replicateLogEntries RPC calls. I search the
> >> master RS logs and saw "Choose Peer" as follows:
> >> Master RS 74: Choose peer 83
> >> Master RS 75: Choose peer 83
> >> Master RS 76: Choose peer 85
> >> From some reason, they ALL talked to 83 (which seems like a bug to me).
> >>
> >> I thought I nailed the bottleneck, so I've changed the factor from 0.1
> to
> >> 1. It had the exact you described, and now all RS were getting the same
> >> amount of replicateLogEntries RPC calls, BUT it didn't budge the
> >> replication throughput. When I checked the network card usage I
> understood
> >> that even when all 3 RS were talking to the same slave RS, network
> wasn't
> >> the bottleneck.
> >>
> >>
> >> >
> >> > In fact your numbers might indicate that two of your source RSs might
> >> have
> >> > picked the same target (you get 2/3 of your throughput via
> replication).
> >> >
> >> >
> >> > In any case, before drawing conclusions this should be tested with a
> >> > larger cluster.
> >> > Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs
> >> will
> >> > round robin all target RSs and lead to better distribution), but that
> >> might
> >> > have other side-effects, too.
> >> >
> >> I'll try getting two clusters of 10 RS each and see if that helps. I
> >> suspect it won't. My hunch is that: since we're replicating with no more
> >> than 10 threads, than if I take my client and set it to 10 threads and
> >> measure the throughput, this will the maximum replication throughput.
> Thus,
> >> if my client will write with let's say 20 threads (or have two client
> with
> >> 10 threads each), than I'm bound to reach an ever increasing
> >> ageOfLastShipped.
> >>
> >> >
> >> > Did you measure the disk IO at each RS at the target? Maybe one of
> them
> >> is
> >> > mostly idle.
> >> >
> >> > I didn't, but I did run my client directly at the slave cluster and
> >> measure throughput and got 18 MB/sec which is bigger than the
> replication
> >> throughput of 11 MB/sec, thus I concluded hard drives couldn't be the
> >> bottleneck here.
> >>
> >> I was thinking of somehow tweaking HBase a bit for my use case: I always
> >> send Puts with new row KV (never update or delete), thus I have no
> >> importance for ordering, thus maybe enable with a flag the ability, on a
> >> certain column family to open multiple threads at the Replication
> Source?
> >>
> >> One more question - keeping the one thread in mind here, having
> compression
> >> on the replicateLogEntries RPC call, shouldn't really help here right?
> >> Since the entire RPC call time is mostly the time it takes to run the
> >> HTable.batch call on the slave RS, right? If I enable compression
> somehow
> >> (hack HBase code to test drive it), I will only speed up transfer time
> of
> >> the batch to the slave RS, but still wait on the insertion of this batch
> >> into the slave cluster.
> >>
> >>
> >>
> >>
> >>
> >>
> >> > -- Lars
> >> > ________________________________
> >> > From: Asaf Mesika <[email protected]>
> >> > To: "[email protected]" <[email protected]>; lars hofhansl <
> >> > [email protected]>
> >> > Sent: Thursday, June 20, 2013 1:38 PM
> >> > Subject: Re: Replication not suited for intensive write applications?
> >> >
> >> >
> >> > Thanks for the answer!
> >> > My responses are inline.
> >> >
> >> > On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <[email protected]>
> >> wrote:
> >> >
> >> > > First off, this is a pretty constructed case leading to a specious
> >> > general
> >> > > conclusion.
> >> > >
> >> > > If you only have three RSs/DNs and the default replication factor
> of 3,
> >> > > each machine will get every single write.
> >> > > That is the first issue. Using HBase makes little sense with such a
> >> small
> >> > > cluster.
> >> > >
> >> > You are correct, non the less - network as I measured, was far from
> its
> >> > capacity thus probably not the bottleneck.
> >> >
> >> > >
> >> > > Secondly, as you say yourself, there are only three regionservers
> >> writing
> >> > > to the replicated cluster using a single thread each in order to
> >> preserve
> >> > > ordering.
> >> > > With more region servers your scale will tip the other way. Again
> more
> >> > > regionservers will make this better.
> >> > >
> >> > > I presume, in production, I will add more region servers to
> accommodate
> >> > growing write demand on my cluster. Hence, my clients will write with
> >> more
> >> > threads. Thus proportionally I will always have a lot more client
> threads
> >> > than the number of region servers (each has one replication thread).
> So,
> >> I
> >> > don't see how adding more region servers will tip the scale to other
> >> side.
> >> > The only way to avoid this, is to design the cluster in such a way
> that
> >> if
> >> > I can handle the events received at the client which write them to
> HBase
> >> > with x Threads, this is the amount of region servers I should have.
> If I
> >> > will have a spike, then it will even out eventually, but this under
> >> > utilizing my cluster hardware, no?
> >> >
> >> >
> >> > > As for your other question, more threads can lead to better
> >> interleaving
> >> > > of CPU and IO, thus leading to better throughput (this relationship
> is
> >> > not
> >> > > linear, though).
> >> > >
> >> > >
> >> >
> >> > >
> >> > > -- Lars
> >> > >
> >> > >
> >> > >
> >> > > ----- Original Message -----
> >> > > From: Asaf Mesika <[email protected]>
> >> > > To: "[email protected]" <[email protected]>
> >> > > Cc:
> >> > > Sent: Thursday, June 20, 2013 3:46 AM
> >> > > Subject: Replication not suited for intensive write applications?
> >> > >
> >> > > Hi,
> >> > >
> >> > > I've been conducting lots of benchmarks to test the maximum
> throughput
> >> of
> >> > > replication in HBase.
> >> > >
> >> > > I've come to the conclusion that HBase replication is not suited for
> >> > write
> >> > > intensive application. I hope that people here can show me where I'm
> >> > wrong.
> >> > >
> >> > > *My setup*
> >> > > *Cluster (*Master and slave are alike)
> >> > > 1 Master, NameNode
> >> > > 3 RS, Data Node
> >> > >
> >> > > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
> >> > ethernet
> >> > > card
> >> > >
> >> > > I insert data into HBase from a java process (client) reading files
> >> from
> >> > > disk, running on the machine running the HBase Master in the master
> >> > > cluster.
> >> > >
> >> > > *Benchmark Results*
> >> > > When the client writes with 10 Threads, then the master cluster
> writes
> >> at
> >> > > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The
> data
> >> > size
> >> > > I wrote is 15 GB, all Puts, to two different tables.
> >> > > Both clusters when tested independently without replication,
> achieved
> >> > write
> >> > > throughput of 17-19 MB/sec, so evidently the replication process is
> the
> >> > > bottleneck.
> >> > >
> >> > > I also tested connectivity between the two clusters using "netcat"
> and
> >> > > achieved 111 MB/sec.
> >> > > I've checked the usage of the network cards both on the client,
> master
> >> > > cluster region server and slave region servers. No computer when
> over
> >> > > 30mb/sec in Receive or Transmit.
> >> > > The way I checked was rather crud but works: I've run "netstat -ie"
> >> > before
> >> > > HBase in the master cluster starts writing and after it finishes.
> The
> >> > same
> >> > > was done on the replicated cluster (when the replication started and
> >> > > finished). I can tell the amount of bytes Received and Transmitted
> and
> >> I
> >> > > know that duration each cluster worked, thus I can calculate the
> >> > > throughput.
> >> > >
> >> > > *The bottleneck in my opinion*
> >> > > Since we've excluded network capacity, and each cluster works at
> faster
> >> > > rate independently, all is left is the replication process.
> >> > > My client writes to the master cluster with 10 Threads, and manages
> to
> >> > > write at 17-18 MB/sec.
> >> > > Each region server has only 1 thread responsible for transmitting
> the
> >> > data
> >> > > written to the WAL to the slave cluster. Thus in my setup I
> effectively
> >> > > have 3 threads writing to the slave cluster.  Thus this is the
> >> > bottleneck,
> >> > > since this process can not be parallelized, since it must transmit
> the
> >> > WAL
> >> > > in a certain order.
> >> > >
> >> > > *Conclusion*
> >> > > When writes intensively to HBase with more than 3 threads (in my
> >> setup),
> >> > > you can't use replication.
> >> > >
> >> > > *Master throughput without replication*
> >> > > On a different note, I have one thing I couldn't understand at all.
> >> > > When turned off replication, and wrote with my client with 3
> threads I
> >> > got
> >> > > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more
> than
> >> > that
> >> > > doesn't help) I achieved maximum throughput of 19 MB/sec.
> >> > > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on
> each
> >> > RS,
> >> > > thus the network capacity was not the bottleneck.
> >> > > On the HBase master machine which ran the client, the network card
> >> again
> >> > > showed Receive throughput of 0.5MB/sec and Transmit throughput of
> 18.28
> >> > > MB/sec. Hence it's the client machine network card creating the
> >> > bottleneck.
> >> > >
> >> > > The only explanation I have is the synchronized writes to the WAL.
> >> Those
> >> > 10
> >> > > threads have to get in line, and one by one, write their batch of
> Puts
> >> to
> >> > > the WAL, which creates a bottleneck.
> >> > >
> >> > > *My question*:
> >> > > The one thing I couldn't understand is: When I write with 3 Threads,
> >> > > meaning I have no more than 3 concurrent RPC requests to write in
> each
> >> > RS.
> >> > > They achieved 11.3 MB/sec.
> >> > > The write to the WAL is synchronized, so why increasing the number
> of
> >> > > threads to 10 (x3 more) actually increased the throughput to 19
> MB/sec?
> >> > > They all get in line to write to the same location, so it seems have
> >> > > concurrent write shouldn't improve throughput at all.
> >> > >
> >> > >
> >> > > Thanks you!
> >> > >
> >> > > Asaf
> >> > > *
> >> > > *
> >> > >
> >> > >
> >> >
> >>
>

Re: Replication not suited for intensive write applications?

Reply via email to