On Mon, Mar 12, 2018 at 7:08 PM, 张晓宁 <zhangxiaon...@jd.com> wrote:

> To your and Brock’s questions, my answers are as below.
>
> What client are you using to benchmark? You might also be bound by the client 
> performance.
>
> My Answer: we are using many force testing machines to test the highest
> TPS on kudu. Our testing client should have enough ability.
>

But, specifically, what client? Is it something you build directly using
the Java client? The C++ client? How many threads are you using? Which
flush mode are you using to write? What buffer sizes are you using?


> I'd verify that the new nodes are assigned tablets? Along with considering
> an increase the number of partitions on the table being tested.
>
> My Answer: Yes, with machines added each time, I created a new table for
> testing so that tablets can be assigned to new machines. For the partition
> strategy, I am using 2-level partitions: the first level is a range
> partition by date(I use 3 partitions here, meaning 3-days data), and the
> second level is a hash partition(I use 3, 6, and 9 respectively for the
> clusters with 3, 6, and 9 tservers).
>
>
Did you delete the original table and wait some time before creating the
new table? Otherwise, you will see a skewed distribution where the new
table will have most of its replicas placed on the new empty machines. For
example:

1) with 6 servers, create table with 18 partitions
-- it will evenly spread replicas on those 6 nodes (probably 9 each)
2) add 3 empty servers, create a new table with 27 partitions
-- the new table will probably have about 18 partitions on the new nodes
and 3 on the existing nodes (6:1 skew)
3) same again
-- the new table will likely have most of its partitions on those 3 empty
nodes again

Of course with skew like that, you'll probably see that those new tables do
not perform well since most of the work would be on a smaller subset of
nodes.

If you delete the tables in between the steps you should see a more even
distribution.


Another possibility that you may be hitting is that our buffering in the
clients is currently cluster-wide. In other words, each time you apply an
operation, it checks if the total buffer limit has been reached, and if it
has, it flushes the pending writes to all tablets. Only once all of those
writes are complete is the batch considered "completed", freeing up space
for the next batch of writes to be buffered. This means that, as the number
of tablets and tablet servers grow, the completion time for the batch is
increasingly dominated by the high-percentile latencies of the writes
rather than the average, causing per-client throughput to drop.

This is tracked by KUDU-1693. I believe there was another JIRA somewhere
related as well, but can't seem to find it. Unfortunately fixing it is not
straightforward, though would have good impact for these cases where a
single writer is fanning out to tens or hundreds of tablets.

-Todd




-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to