Re: Replication not suited for intensive write applications?

lars hofhansl Fri, 21 Jun 2013 06:06:02 -0700

Hmm... Yes. Was worth a try :)  Should've checked and I even wrote that part of 
the code.


I have no good explanation then, and also no good suggestion about how to 
improve this.



________________________________
 From: Asaf Mesika <[email protected]>
To: "[email protected]" <[email protected]>; lars hofhansl 
<[email protected]> 
Sent: Friday, June 21, 2013 5:50 AM
Subject: Re: Replication not suited for intensive write applications?
 

On Fri, Jun 21, 2013 at 2:38 PM, lars hofhansl <[email protected]> wrote:

> Another thought...
>
> I assume you only write to a single table, right? How large are your rows
> on average?
>
> I'm writing to 2 tables: Avg row size for 1st table is 1500 bytes, and the
seconds around is around 800 bytes

>
> Replication will send 64mb blocks by default (or 25000 edits, whatever is
> smaller). The default HTable buffer is 2mb only, so the slave RS receiving
> a block of edits (assuming it is a full block), has to do 32 rounds of
> splitting the edits per region in order to apply them.
>
> In the ReplicationSink.java (0.94.6) I see that HTable.batch() is used,
which writes directly to RS without buffers?

  private void batch(byte[] tableName, List<Row> rows) throws IOException {

    if (rows.isEmpty()) {

      return;

    }

    HTableInterface table = null;

    try {

      table = new HTable(tableName, this.sharedHtableCon, this.
sharedThreadPool);

      table.batch(rows);

      this.metrics.appliedOpsRate.inc(rows.size());

    } catch (InterruptedException ix) {

      throw new IOException(ix);

    } finally {

      if (table != null) {

        table.close();

      }

    }

  }



>
> There is no setting specifically targeted at the buffer size for
> replication, but maybe you could increase "hbase.client.write.buffer" to
> 64mb (67108864) on the slave cluster and see whether that makes a
> difference. If it does we can (1) add a setting to control the
> ReplicationSink HTable's buffer size, or (2) just have it match the
> replication buffer size "replication.source.size.capacity".
>
>
> -- Lars
> ________________________________
> From: lars hofhansl <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Friday, June 21, 2013 1:48 AM
> Subject: Re: Replication not suited for intensive write applications?
>
>
> Thanks for checking... Interesting. So talking to 3RSs as opposed to only
> 1 before had no effect on the throughput?
>
> Would be good to explore this a bit more.
> Since our RPC is not streaming, latency will effect throughout. In this
> case there is latency while all edits are shipped to the RS in the slave
> cluster and then extra latency when applying the edits there (which are
> likely not local to that RS). A true streaming API should be better. If
> that is the case compression *could* help (but that is a big if).
>
> The single thread shipping the edits to the slave should not be an issue
> as the edits are actually applied by the slave RS, which will use multiple
> threads to apply the edits in the local cluster.
>
> Also my first reply - upon re-reading it - sounded a bit rough, that was
> not intended.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Asaf Mesika <[email protected]>
> To: "[email protected]" <[email protected]>; lars hofhansl <
> [email protected]>
> Cc:
> Sent: Thursday, June 20, 2013 10:16 PM
> Subject: Re: Replication not suited for intensive write applications?
>
> Thanks for the taking the time to answer!
> My answers are inline.
>
> On Fri, Jun 21, 2013 at 1:47 AM, lars hofhansl <[email protected]> wrote:
>
> > I see.
> >
> > In HBase you have machines for both CPU (to serve requests) and storage
> > (to hold the data).
> >
> > If you only grow your cluster for CPU and you keep all RegionServers 100%
> > busy at all times, you are correct.
> >
> > Maybe you need to increase replication.source.size.capacity and/or
> > replication.source.nb.capacity (although I doubt that this will help
> here).
> >
> > I was thinking of giving a shot, but theoretically it should not affect,
> since I'm doing anything in parallel, right?
>
>
> > Also a replication source will pick region server from the target at
> > random (10% of them at default). That has two effects:
> > 1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
> > 2. With such a small cluster setup the likelihood is high that two or
> more
> > RSs in the source will happen to pick the same RS at the target. Thus
> > leading less throughput.
> >
> You are absolutely correct. In Graphite, in the beginning, I saw that only
> one slave RS was getting all replicateLogEntries RPC calls. I search the
> master RS logs and saw "Choose Peer" as follows:
> Master RS 74: Choose peer 83
> Master RS 75: Choose peer 83
> Master RS 76: Choose peer 85
> From some reason, they ALL talked to 83 (which seems like a bug to me).
>
> I thought I nailed the bottleneck, so I've changed the factor from 0.1 to
> 1. It had the exact you described, and now all RS were getting the same
> amount of replicateLogEntries RPC calls, BUT it didn't budge the
> replication throughput. When I checked the network card usage I understood
> that even when all 3 RS were talking to the same slave RS, network wasn't
> the bottleneck.
>
>
> >
> > In fact your numbers might indicate that two of your source RSs might
> have
> > picked the same target (you get 2/3 of your throughput via replication).
> >
> >
> > In any case, before drawing conclusions this should be tested with a
> > larger cluster.
> > Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs
> will
> > round robin all target RSs and lead to better distribution), but that
> might
> > have other side-effects, too.
> >
> I'll try getting two clusters of 10 RS each and see if that helps. I
> suspect it won't. My hunch is that: since we're replicating with no more
> than 10 threads, than if I take my client and set it to 10 threads and
> measure the throughput, this will the maximum replication throughput. Thus,
> if my client will write with let's say 20 threads (or have two client with
> 10 threads each), than I'm bound to reach an ever increasing
> ageOfLastShipped.
>
> >
> > Did you measure the disk IO at each RS at the target? Maybe one of them
> is
> > mostly idle.
> >
> > I didn't, but I did run my client directly at the slave cluster and
> measure throughput and got 18 MB/sec which is bigger than the replication
> throughput of 11 MB/sec, thus I concluded hard drives couldn't be the
> bottleneck here.
>
> I was thinking of somehow tweaking HBase a bit for my use case: I always
> send Puts with new row KV (never update or delete), thus I have no
> importance for ordering, thus maybe enable with a flag the ability, on a
> certain column family to open multiple threads at the Replication Source?
>
> One more question - keeping the one thread in mind here, having compression
> on the replicateLogEntries RPC call, shouldn't really help here right?
> Since the entire RPC call time is mostly the time it takes to run the
> HTable.batch call on the slave RS, right? If I enable compression somehow
> (hack HBase code to test drive it), I will only speed up transfer time of
> the batch to the slave RS, but still wait on the insertion of this batch
> into the slave cluster.
>
>
>
>
>
>
> > -- Lars
> > ________________________________
> > From: Asaf Mesika <[email protected]>
> > To: "[email protected]" <[email protected]>; lars hofhansl <
> > [email protected]>
> > Sent: Thursday, June 20, 2013 1:38 PM
> > Subject: Re: Replication not suited for intensive write applications?
> >
> >
> > Thanks for the answer!
> > My responses are inline.
> >
> > On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <[email protected]>
> wrote:
> >
> > > First off, this is a pretty constructed case leading to a specious
> > general
> > > conclusion.
> > >
> > > If you only have three RSs/DNs and the default replication factor of 3,
> > > each machine will get every single write.
> > > That is the first issue. Using HBase makes little sense with such a
> small
> > > cluster.
> > >
> > You are correct, non the less - network as I measured, was far from its
> > capacity thus probably not the bottleneck.
> >
> > >
> > > Secondly, as you say yourself, there are only three regionservers
> writing
> > > to the replicated cluster using a single thread each in order to
> preserve
> > > ordering.
> > > With more region servers your scale will tip the other way. Again more
> > > regionservers will make this better.
> > >
> > > I presume, in production, I will add more region servers to accommodate
> > growing write demand on my cluster. Hence, my clients will write with
> more
> > threads. Thus proportionally I will always have a lot more client threads
> > than the number of region servers (each has one replication thread). So,
> I
> > don't see how adding more region servers will tip the scale to other
> side.
> > The only way to avoid this, is to design the cluster in such a way that
> if
> > I can handle the events received at the client which write them to HBase
> > with x Threads, this is the amount of region servers I should have. If I
> > will have a spike, then it will even out eventually, but this under
> > utilizing my cluster hardware, no?
> >
> >
> > > As for your other question, more threads can lead to better
> interleaving
> > > of CPU and IO, thus leading to better throughput (this relationship is
> > not
> > > linear, though).
> > >
> > >
> >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: Asaf Mesika <[email protected]>
> > > To: "[email protected]" <[email protected]>
> > > Cc:
> > > Sent: Thursday, June 20, 2013 3:46 AM
> > > Subject: Replication not suited for intensive write applications?
> > >
> > > Hi,
> > >
> > > I've been conducting lots of benchmarks to test the maximum throughput
> of
> > > replication in HBase.
> > >
> > > I've come to the conclusion that HBase replication is not suited for
> > write
> > > intensive application. I hope that people here can show me where I'm
> > wrong.
> > >
> > > *My setup*
> > > *Cluster (*Master and slave are alike)
> > > 1 Master, NameNode
> > > 3 RS, Data Node
> > >
> > > All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit
> > ethernet
> > > card
> > >
> > > I insert data into HBase from a java process (client) reading files
> from
> > > disk, running on the machine running the HBase Master in the master
> > > cluster.
> > >
> > > *Benchmark Results*
> > > When the client writes with 10 Threads, then the master cluster writes
> at
> > > 17 MB/sec, while the replicated cluster writes at 12 Mb/sec. The data
> > size
> > > I wrote is 15 GB, all Puts, to two different tables.
> > > Both clusters when tested independently without replication, achieved
> > write
> > > throughput of 17-19 MB/sec, so evidently the replication process is the
> > > bottleneck.
> > >
> > > I also tested connectivity between the two clusters using "netcat" and
> > > achieved 111 MB/sec.
> > > I've checked the usage of the network cards both on the client, master
> > > cluster region server and slave region servers. No computer when over
> > > 30mb/sec in Receive or Transmit.
> > > The way I checked was rather crud but works: I've run "netstat -ie"
> > before
> > > HBase in the master cluster starts writing and after it finishes. The
> > same
> > > was done on the replicated cluster (when the replication started and
> > > finished). I can tell the amount of bytes Received and Transmitted and
> I
> > > know that duration each cluster worked, thus I can calculate the
> > > throughput.
> > >
> > > *The bottleneck in my opinion*
> > > Since we've excluded network capacity, and each cluster works at faster
> > > rate independently, all is left is the replication process.
> > > My client writes to the master cluster with 10 Threads, and manages to
> > > write at 17-18 MB/sec.
> > > Each region server has only 1 thread responsible for transmitting the
> > data
> > > written to the WAL to the slave cluster. Thus in my setup I effectively
> > > have 3 threads writing to the slave cluster.  Thus this is the
> > bottleneck,
> > > since this process can not be parallelized, since it must transmit the
> > WAL
> > > in a certain order.
> > >
> > > *Conclusion*
> > > When writes intensively to HBase with more than 3 threads (in my
> setup),
> > > you can't use replication.
> > >
> > > *Master throughput without replication*
> > > On a different note, I have one thing I couldn't understand at all.
> > > When turned off replication, and wrote with my client with 3 threads I
> > got
> > > throughput of 11.3 MB/sec. When I wrote with 10 Threads (any more than
> > that
> > > doesn't help) I achieved maximum throughput of 19 MB/sec.
> > > The network cards showed 30MB/sec Receive and 20MB/sec Transmit on each
> > RS,
> > > thus the network capacity was not the bottleneck.
> > > On the HBase master machine which ran the client, the network card
> again
> > > showed Receive throughput of 0.5MB/sec and Transmit throughput of 18.28
> > > MB/sec. Hence it's the client machine network card creating the
> > bottleneck.
> > >
> > > The only explanation I have is the synchronized writes to the WAL.
> Those
> > 10
> > > threads have to get in line, and one by one, write their batch of Puts
> to
> > > the WAL, which creates a bottleneck.
> > >
> > > *My question*:
> > > The one thing I couldn't understand is: When I write with 3 Threads,
> > > meaning I have no more than 3 concurrent RPC requests to write in each
> > RS.
> > > They achieved 11.3 MB/sec.
> > > The write to the WAL is synchronized, so why increasing the number of
> > > threads to 10 (x3 more) actually increased the throughput to 19 MB/sec?
> > > They all get in line to write to the same location, so it seems have
> > > concurrent write shouldn't improve throughput at all.
> > >
> > >
> > > Thanks you!
> > >
> > > Asaf
> > > *
> > > *
> > >
> > >
> >
>

Re: Replication not suited for intensive write applications?

Reply via email to