Re: Low ingestion rate from Kafka

Todd Lipcon Wed, 01 Nov 2017 00:24:42 -0700

On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon <t...@cloudera.com> wrote:


> Sounds good.
>
> BTW, you can try a quick load test using the 'kudu perf loadgen' tool.
> For example something like:
>
> kudu perf loadgen my-kudu-master.example.com --num-threads=8
> --num-rows-per-thread=1000000 --table-num-buckets=32
>
> There are also a bunch of options to tune buffer sizes, flush options,
> etc. But with the default settings above on an 8-node cluster I have, I was
> able to insert 8M rows in 44 seconds (180k/sec).
>
> Adding --buffer-size-bytes=10000000 almost doubled the above throughput
> (330k rows/sec)
>

One more quick datapoint: I ran the above command simultaneously (in
parallel) four times. Despite running 4x as many clients,  they all
finished in the same time as a single client did (ie aggregate throughput
~1.2M rows/sec).

Again this isn't a scientific benchmark, and it's such a short burst of
activity that it doesn't represent a real workload, but 15k rows/sec is
definitely at least an order of magnitude lower than the peak throughput I
would expect.

-Todd


>
> -Todd
>
>
>
>> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>>>
>>>
>>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <sunc...@uber.com> wrote:
>>>
>>>> Thanks Zhen and Todd.
>>>>
>>>> Yes increasing the # of consumers will definitely help, but we also
>>>> want to test the best throughput we can get from Kudu.
>>>>
>>>
>>> Sure, but increasing the number of consumers can increase the throughput
>>> (without increasing the number of Kudu tablet servers).
>>>
>>> Currently, if you run 'top' on the TS nodes, do you see them using a
>>> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO
>>> utilization? My guess is that at 15k/sec you are hardly utilizing the
>>> nodes, and you're mostly bound by round trip latencies, etc.
>>>
>>>
>>>>
>>>> I think the default batch size is 1000 rows?
>>>>
>>>
>>> In manual flush mode, it's up to you to determine how big your batches
>>> are. It will buffer until you call 'Flush()'. So you could wait until
>>> you've accumulated way more than 1000 to flush.
>>>
>>>
>>>> I tested with a few different options between 1000 and 200000, but
>>>> always got some number between 15K to 20K per sec. Also tried flush
>>>> background mode and 32 hash partitions but results are similar.
>>>>
>>>
>>> In your AUTO_FLUSH test, were you still calling Flush()?
>>>
>>>
>>>> The primary key is UUID + some string column though - they always come
>>>> in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, etc.
>>>>
>>>
>>> Given this, are you hash-partitioning on just the UUID portion of the
>>> PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the
>>> UUID. This should ensure that you get pretty good batching of the writes.
>>>
>>> Todd
>>>
>>>
>>>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>>>
>>>>> In addition to what Zhen suggests, I'm also curious how you are sizing
>>>>> your batches in manual-flush mode? With 128 hash partitions, each batch is
>>>>> generating 128 RPCs, so if for example you are only batching 1000 rows at 
>>>>> a
>>>>> time, you'll end up with a lot of fixed overhead in each RPC to insert 
>>>>> just
>>>>> 1000/128 = ~8 rows.
>>>>>
>>>>> Generally I would expect an 8 node cluster (even with HDDs) to be able
>>>>> to sustain several hundred thousand rows/second insert rate. Of course, it
>>>>> depends on the size of the rows and also the primary key you've chosen. If
>>>>> your primary key is generally increasing (such as the kafka sequence
>>>>> number) then you should have very little compaction and good performance.
>>>>>
>>>>> -Todd
>>>>>
>>>>> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zhqu...@gmail.com> wrote:
>>>>>
>>>>>> Maybe you can add your consumer number? In my opinion, more threads
>>>>>> to insert can give a better throughput.
>>>>>>
>>>>>> 2017-10-31 15:07 GMT+08:00 Chao Sun <sunc...@uber.com>:
>>>>>>
>>>>>>> OK. Thanks! I changed to manual flush mode and it increased to ~15K
>>>>>>> / sec. :)
>>>>>>>
>>>>>>> Is there any other tuning I can do to further improve this? and
>>>>>>> also, how much would
>>>>>>> SSD help in this case (only upsert)?
>>>>>>>
>>>>>>> Thanks again,
>>>>>>> Chao
>>>>>>>
>>>>>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <t...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If you want to manage batching yourself you can use the manual
>>>>>>>> flush mode. Easiest would be the auto flush background mode.
>>>>>>>>
>>>>>>>> Todd
>>>>>>>>
>>>>>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <sunc...@uber.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Todd,
>>>>>>>>>
>>>>>>>>> Thanks for the reply! I used a single Kafka consumer to pull the
>>>>>>>>> data.
>>>>>>>>> For Kudu, I was doing something very simple that basically just
>>>>>>>>> follow the example here
>>>>>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>>>>>>> .
>>>>>>>>> In specific:
>>>>>>>>>
>>>>>>>>> loop {
>>>>>>>>>   Insert insert = kuduTable.newInsert();
>>>>>>>>>   PartialRow row = insert.getRow();
>>>>>>>>>   // fill the columns
>>>>>>>>>   kuduSession.apply(insert)
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> I didn't specify the flushing mode, so it will pick up the
>>>>>>>>> AUTO_FLUSH_SYNC as default?
>>>>>>>>> should I use MANUAL_FLUSH?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Chao
>>>>>>>>>
>>>>>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <t...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Chao,
>>>>>>>>>>
>>>>>>>>>> Nice to hear you are checking out Kudu.
>>>>>>>>>>
>>>>>>>>>> What are you using to consume from Kafka and write to Kudu? Is it
>>>>>>>>>> possible that it is Java code and you are using the SYNC flush mode? 
>>>>>>>>>> That
>>>>>>>>>> would result in a separate round trip for each record and thus very 
>>>>>>>>>> low
>>>>>>>>>> throughput.
>>>>>>>>>>
>>>>>>>>>> Todd
>>>>>>>>>>
>>>>>>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <sunc...@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>>>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>>>>>>> The data are coming from Kafka at a rate of around 30K / sec, and
>>>>>>>>>> hash partitioned into 128 buckets. However, with default settings, 
>>>>>>>>>> Kudu can
>>>>>>>>>> only consume the topics at a rate of around 1.5K / second. This is a 
>>>>>>>>>> direct
>>>>>>>>>> ingest with no transformation on the data.
>>>>>>>>>>
>>>>>>>>>> Could this because I was using the default configurations? also
>>>>>>>>>> we are using Kudu on HDD - could that also be related?
>>>>>>>>>>
>>>>>>>>>> Any help would be appreciated. Thanks.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Chao
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Low ingestion rate from Kafka

Reply via email to