HI Todd, Thank you for the analysis!

Pls see my comments with XiaoNing.

发件人: Todd Lipcon [mailto:t...@cloudera.com]
发送时间: 2018年3月13日 23:43
收件人: user@kudu.apache.org
主题: Re: Follow-up for "Kudu cluster performance cannot grow up with machines 

On Mon, Mar 12, 2018 at 7:08 PM, 张晓宁 
<zhangxiaon...@jd.com<mailto:zhangxiaon...@jd.com>> wrote:
To your and Brock’s questions, my answers are as below.

What client are you using to benchmark? You might also be bound by the client 
My Answer: we are using many force testing machines to test the highest TPS on 
kudu. Our testing client should have enough ability.

But, specifically, what client? Is it something you build directly using the 
Java client? The C++ client? How many threads are you using? Which flush mode 
are you using to write? What buffer sizes are you using?
XiaoNing: We are using Java client. The test will increase thread number as the 
testing is going on, generally the peak point is reached with around 500 
threads. We are using manual flushing. We were using the default automatic 
flushing mode at the beginning but we did not get a good performance with that 
moded. For the “buffer sizes”, we are using 10K. Since our batch size is 100, 
for each flush, we have 100 * 200 = 20K bytes. Do you have any good advice on 
this setting?
I'd verify that the new nodes are assigned tablets? Along with considering an 
increase the number of partitions on the table being tested.
My Answer: Yes, with machines added each time, I created a new table for 
testing so that tablets can be assigned to new machines. For the partition 
strategy, I am using 2-level partitions: the first level is a range partition 
by date(I use 3 partitions here, meaning 3-days data), and the second level is 
a hash partition(I use 3, 6, and 9 respectively for the clusters with 3, 6, and 
9 tservers).

Did you delete the original table and wait some time before creating the new 
table? Otherwise, you will see a skewed distribution where the new table will 
have most of its replicas placed on the new empty machines. For example:

1) with 6 servers, create table with 18 partitions
-- it will evenly spread replicas on those 6 nodes (probably 9 each)
2) add 3 empty servers, create a new table with 27 partitions
-- the new table will probably have about 18 partitions on the new nodes and 3 
on the existing nodes (6:1 skew)
3) same again
-- the new table will likely have most of its partitions on those 3 empty nodes 

Of course with skew like that, you'll probably see that those new tables do not 
perform well since most of the work would be on a smaller subset of nodes.

If you delete the tables in between the steps you should see a more even 
XiaoNing: Yes, I always delete the old table before creating the new one. But 
it seems the old data is not removed with table deletion, is that true? At the 
very beginning, we were testing the 1-master-9-tserver, and we got the same 
result, so I donot think the partition is a problem here. Anyway, I can do some 
more tests again on it.

Another possibility that you may be hitting is that our buffering in the 
clients is currently cluster-wide. In other words, each time you apply an 
operation, it checks if the total buffer limit has been reached, and if it has, 
it flushes the pending writes to all tablets. Only once all of those writes are 
complete is the batch considered "completed", freeing up space for the next 
batch of writes to be buffered. This means that, as the number of tablets and 
tablet servers grow, the completion time for the batch is increasingly 
dominated by the high-percentile latencies of the writes rather than the 
average, causing per-client throughput to drop.
XiaoNing: As mentioned above, we are using 10K as the client buffer size and 
each of our batch data size is 20K. Do you think this will impact the 
performance? As the tablet servers added to cluster, the flush time will 
increase as well, right? In your benchmark testing, how many hosts are you 
using for a cluster?

This is tracked by KUDU-1693. I believe there was another JIRA somewhere 
related as well, but can't seem to find it. Unfortunately fixing it is not 
straightforward, though would have good impact for these cases where a single 
writer is fanning out to tens or hundreds of tablets.


Todd Lipcon
Software Engineer, Cloudera

Reply via email to