I was using flush() after sending a bunch of mutations to the batchwriters to 
limit their latency. I thought it would normally flush the buffer to ensure 
that the maxLatency is not violated. If the maxLatency is quite large, how do I 
ensure that it doesn't wait a long time before writing?

If the returned batchscanners are all thread safe, then I'm still going to have 
the bottleneck of their synchronized addMutations method, correct?

I'm looking for "org.apache.accumulo.client.impl" in the log4j.properties, 
generic_logger.xml the and other config files, but can't locate it. Do I need 
to create a new entry for it there?

Thanks,
David

From: Keith Turner [mailto:[email protected]]
Sent: Thursday, September 19, 2013 7:01 PM
To: [email protected]
Subject: Re: BatchWriter performance on 1.4

On Thu, Sep 19, 2013 at 5:08 PM, Slater, David M. 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Keith, I'm looking at it now. It appears like what I would want. As for 
the proper usage...

Would I create one using the Connector,
then .getBatchWriter() for each of the tables I'm interested in,
add data to each of BatchWriters returned,

yes.

and then hit flush() when I want to write all of that to get written?

Why are you calling flush() ?   Doing this frequently will increase rpc 
overhead and lower throughput.


Would the individual batch writers spawned by the multiTableBatchWriter still 
have synchronized addMutations() methods so I would have to worry about 
blocking still, or would that all happen at the flush() method?

The returned batch writers are thread safe. They all add to the same 
queue/buffer in a synchronized manner.   Calling flush() on any of the batch 
writers returned from getBatchWriter() will block the others.

If you enable set the log4j log level to TRACE for 
org.apache.accumulo.client.impl you can see output like the following.  Binning 
is the process of taking each mutation and deciding which tablet and tablet 
server it goes to.

  2013-09-19 18:43:37,261 [impl.ThriftTransportPool] TRACE: Using existing 
connection to 127.0.0.1:9997<http://127.0.0.1:9997>
  2013-09-19 18:43:37,393 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13  
Binning 80909 mutations for table 3
  2013-09-19 18:43:37,402 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13  Binned 
80909 mutations for table 3 to 1 tservers in 0.009 secs
  2013-09-19 18:43:37,402 [impl.TabletServerBatchWriter] TRACE: Started sending 
80,909 mutations to 1 tablet servers
  2013-09-19 18:43:37,656 [impl.ThriftTransportPool] TRACE: Returned connection 
127.0.0.1:9997<http://127.0.0.1:9997> (120000) ioCount : 1459116
  2013-09-19 18:43:37,657 [impl.TabletServerBatchWriter] TRACE: sent 80,909 
mutations to 127.0.0.1:9997<http://127.0.0.1:9997> in 0.40 secs (204,832.91 
mutations/sec) with 0 failures

When you close the batch writer, it will log some summary stats like the 
following.


  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: TABLET SERVER 
BATCH WRITER STATISTICS
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Added           
     :  1,000,000 mutations
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Sent            
     :  1,000,000 mutations
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Resent 
percentage   :       0.00%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall time    
     :       5.94 secs
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall send 
rate    : 168,406.87 mutations/sec
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Send efficiency 
     :      86.91%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: BACKGROUND 
WRITER PROCESS STATISTICS
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Total send time 
     :       5.16 secs  86.91%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Average send 
rate    : 193,760.90 mutations/sec
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Total bin time  
     :       0.46 secs   7.81%
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Average bin 
rate     : 2,155,172.41 mutations/sec
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tservers per 
batch   :     1.00 avg       1 min      1 max
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tablets per 
batch    :     1.00 avg       1 min      1 max
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: SYSTEM 
STATISTICS
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: JVM GC Time     
     :       0.53 secs
  2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: JVM Compile 
Time     :       1.60 secs
  2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: System load 
average : initial=  0.22 final=  0.20

What do these numbers look like for you?

Keith


From: Keith Turner [mailto:[email protected]<mailto:[email protected]>]
Sent: Thursday, September 19, 2013 12:39 PM
To: [email protected]<mailto:[email protected]>

Subject: Re: BatchWriter performance on 1.4

Are you aware of the multi table batch writer?  I am not sure if it would be 
useful, but wanted to make sure you knew about it.   It will use the same 
thread pool to process mutations for multiple tables.  Also it will batch 
mutations for multiple tablets into the same rpc calls.

On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M. 
<[email protected]<mailto:[email protected]>> wrote:
Hi, I'm running a single-threaded ingestion program that takes data from an 
input source, parses it into mutations, and then writes those mutations 
(sequentially) to four different BatchWriters (all on different tables). Most 
of the time (95%) taken is on adding mutations, e.g. 
batchWriter.addMutations(mutations); I am wondering how to reduce the time 
taken by these methods.

1) For the method batchWriter.addMutations(Iterable<Mutation>), does it matter 
for performance whether the mutations returned by the iterator are sorted in 
lexicographic order?

2) If the Iterable<Mutation> that I pass to the BatchWriter is very large, will 
I need to wait for a number of Batches to be written and flushed before it will 
finish iterating, or does it transfer the elements of the Iterable to a 
different intermediate list?

3) If that is the case, would it then make sense to spawn off short threads for 
each time I make use of addMutations?

At a high level, my code looks like this:

BatchWriter bw1 = connector.createBatchWriter(...)
BatchWriter bw2 = ...
...
while(true) {
String[] data = input.getData();
List<Mutation> mutations1 = parseData1(data);
                List<Mutation> mutations2 = parseData2(data);
                ...
                bw1.addMutations(mutations1);
                bw2.addMutations(mutations2);
                ...
}
Thanks,
David


Reply via email to