I forgot to mention that we only have one Kudu master on our dev cluster
and 6 TS. I guess one of the tests you questioned confirms this theory
because in that test we only raised rpc thread count for the master.

Thanks, Grant. We did look at PutKudu processor but we needed to implement
some complex custom logic and also use special error handling so ended up
writing custom Groovy processor. It worked really well (now that we figured
this issue out). It is also quite limited as of now - for example, one has
to specify target table upfront and we need to handle 250 on the fly. We
did see they reused client but we could not figure out how to do it with a
custom Groovy processor.

While NiFi has the ability to support native Java processors and things
like client reuse would be easy but we did not go that route since it was
really cumbersome to debug/deploy and we were not willing to commit the
time to do it that way.

Boris


On Thu, Feb 14, 2019 at 5:04 PM Jean-Daniel Cryans <jdcry...@apache.org>
wrote:

> One order of magnitude improvement is great news!
>
> There's something to learn here... Maybe the masters should be screaming
> if they're being slammed with tablet location requests (which I'm guessing
> was your case). Or should we have some recipes like "here's how you should
> write to Kudu from Nifi"? Any thoughts?
>
> In any case, thanks for reporting back!
>
> J-D
>
> On Thu, Feb 14, 2019 at 1:56 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Hi J-D, we just solved our performance issue! client reuse did the trick.
>> We knew it was something we had to do but without going into details, it
>> was not quite an easy thing to do because we use NiFi to coordinate stuff
>> and could not find a way to reuse Kudu client safely. Data comes in mini
>> batches (200 operations on average) and for every mini batch we would open
>> a new client, new session, then close. We also run them in the fastest mode
>> (autoflush background) but if we have any errors, we do another pass in
>> AUTO_FLUSH_SYNC mode because of KUDU-2625 as we need to implement some
>> custom error handling logic (some errors in our case can be safely ignored
>> while others are not).
>>
>> Long story short, we found a way to reuse client instance in NiFi while
>> still keeping native concurrency benefits of NiFi and our performance
>> improved by 10 times at least!
>>
>> Thanks for your help and ideas! It is quite a relief for us!
>> Boris
>>
>> On Thu, Feb 14, 2019 at 11:51 AM Jean-Daniel Cryans <jdcry...@apache.org>
>> wrote:
>>
>>> Hi Boris,
>>>
>>> Thank you for all those details. Some questions:
>>>
>>> - Is test 3 right for the tserver threads/queue? Or did you really just
>>> bump the master's threads/queue and it magically made it ~2.3 times faster?
>>> And then in test 5 you also bumped the server and it barely made it better?
>>> Unless I'm missing something big, this is very unexpected. Are you
>>> re-creating the Kudu Client each time you send operations? Otherwise, the
>>> tablet locations should stay in cache and the masters would be completely
>>> out of the picture after the first few seconds.
>>>
>>> - Are you batching the writes? What kind of flush mode are you using?
>>>
>>> - Are there some server-side graphs you can share? I'd like to see
>>> inserts/second across the servers for the duration of the tests.
>>>
>>> - Can you share your table schema and partitions schema? For the columns
>>> I'm mostly interested in the row keys and the cardinality of each column.
>>>
>>> Thanks,
>>>
>>> J-D
>>>
>>> On Thu, Feb 14, 2019 at 5:41 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> Hi J-D and thanks for your comments.
>>>>
>>>> On a very high level, we subscribe to 750 Kafka topics and wrote a
>>>> custom app using Java API to insert, update or delete into 250 Kudu tables
>>>> (messages from 3 topics are merged into a single Kudu table). Our custom
>>>> code can spawn any number of threads and we experimented with 10,20 and 50.
>>>>
>>>> When we did our first test on the development 6 node cluster
>>>> (high-density 2x22 core beast) for 5 tables/15 topics with 10 concurrent
>>>> threads  - all was good and promising.
>>>> Once we've added all tables/topics, our process became very slow and
>>>> throughput dropped by 20-30 times. We increased the number of threads for
>>>> our custom code to 50 and this is when we noticed that Kudu uses only 10
>>>> threads and other threads are waiting in the queue.
>>>>
>>>> Out target Kudu tables were empty when we started and the cluster was
>>>> pretty much idle.
>>>>
>>>> Here are the results of our benchmarks that you might find interesting.
>>>>
>>>> “threads/queue” in the header is rpc_service_threads ,
>>>> rpc_service_queue_length.
>>>>
>>>> we started with defaults and test 5 and 6 turned out to be the best.
>>>> Quite a dramatic difference with test 1.
>>>>
>>>> I should also mention, we are running this on our DEV 6 node cluster
>>>> but it is pretty beefy (2x24 cpu core, 256Gb of Ram, 12 disks etc.) and the
>>>> cluster was not doing anything else but only writing into Kudu.
>>>>
>>>> It is also interesting that test 7 did not give any further
>>>> improvements - our speculation here is that we just hit the limits of our 6
>>>> node Kudu cluster, since it can handle so many tablets at once and we use a
>>>> replication factor of 3.
>>>>
>>>> Another test we did, we created a simple app that would run selects on
>>>> these tables, while the first app keeps writing into these tables.
>>>> Throughput dropped quite a bit as well with the defaults but bumping rpc
>>>> threads helped.
>>>>
>>>> If you have any other thoughts/observations, I would like to hear them!
>>>>
>>>> I think things like that should be somewhere in the Kudu doc along with
>>>> a few important parameters that new orgs to Kudu must tweak. I can write a
>>>> blog post about it, but I am no Kudu dev so do not want to represent
>>>> anything.
>>>>
>>>> For example, we've learned the hard way to tweak these two parameters
>>>> right away as Insert performance was terrible out of the box:
>>>>
>>>> [image: Machine generated alternative text: Will Berkeley 4:40 AM
>>>> @wangxg you shouldn't need to tune too many parameters despite the large
>>>> number of available ones tune --memory_limit_hard_bytes to control the
>>>> total amount of memory kudu will use tune --maintenance_manager_num_threads
>>>> to about 1/3 of the number of disks you are using for kudu (assuming you
>>>> are using the latest version, 1.5)]
>>>>
>>>>
>>>>   custom app threads Master Tablet Total Operations# Total Duration
>>>> from start to finish Avg Operations # per second Avg duration ms per
>>>> flowfile Avg # of operations per flowfile
>>>> threads,queue threads,queue
>>>> test 1 10 10,50 10,50 5,906,849 60 minutes 98,447 404.389279 ms
>>>> 66.8861423
>>>> test 2 30 10,50 10,50 3,107,938 32 minutes 97,123 1611.71499 ms
>>>> 102.416727
>>>> test 3 10 30,100 10,50 13,617,274 60 minutes 226,954 448.954472 ms
>>>> 196.878148
>>>> test 4 30 30,100 10,50 5,794,268 60 minutes 96,571 2342.55094 ms
>>>> 114.822107
>>>> test 5 10 30,100 30,100 16,813,710 60 minutes 280,228 391.113644 ms
>>>> 183.887024
>>>> test 6 30 30,100 30,100 15,903,303 60 minutes 265,055 2300.38629 ms
>>>> 341.316543
>>>> test 7 30 50,200 50,200 12,549,114 60 minutes 209,151 2364.45707 ms
>>>> 276.851262
>>>>
>>>> On Wed, Feb 13, 2019 at 11:39 AM Jean-Daniel Cryans <
>>>> jdcry...@apache.org> wrote:
>>>>
>>>>> Some comments on the original problem: "we need to process 1000s of
>>>>> operations per second and noticed that our Kudu 1.5 cluster was only using
>>>>> 10 threads while our application spins up 50 clients/threads"
>>>>>
>>>>> I wouldn't directly infer that 20 threads won't be enough to match
>>>>> your needs. The time it takes to service a request can vary greatly, a
>>>>> single thread could process 500 operations that take 2ms to run, or 2 that
>>>>> take 500ms to run, and you have 20 of those. The queue is there to make
>>>>> sure that the threads are kept busy instead of bouncing the clients back
>>>>> the moment all the threads are occupied. Your 50 threads can't constantly
>>>>> pound all the tservers, there's time spent on the network and whatever
>>>>> processing needs to happen client-side before they go back to Kudu.
>>>>>
>>>>> TBH there's not a whole lot of science around how we set those two
>>>>> defaults (# of threads and queue size), but it's very workload-dependent.
>>>>> Ideally the tservers would just right-size the pools based on the kind of
>>>>> requests that are coming in and the amount of memory it can use. I guess
>>>>> CPU also comes in the picture but again it depends on the workload, Kudu
>>>>> stores data so it tends to be IO-bound more than CPU-bound.
>>>>>
>>>>> But the memory concern is very real. To be put in the queue the
>>>>> requests must be read from the network, so it doesn't take that many 2MB
>>>>> batches of inserts to occupy a lot of memory. Scans, on the other hand,
>>>>> become a memory concern in the threads because that's where they
>>>>> materialize data in memory and, depending on the number of columns scanned
>>>>> and the kind of data that's read, it could be a lot. That's why the
>>>>> defaults aren't arbitrarily high, they're more on the safe side.
>>>>>
>>>>> Have you actually encountered performance issues that you could trace
>>>>> back to this?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> J-D
>>>>>
>>>>> On Wed, Feb 13, 2019 at 3:49 AM Boris <boris...@gmail.com> wrote:
>>>>>
>>>>>> But if we bump threads count to 50, and queue default is 50, we
>>>>>> should probably bump queue to 100 or something like that, right?
>>>>>>
>>>>>> On Wed, Feb 13, 2019, 00:54 Hao Hao <hao....@cloudera.com wrote:
>>>>>>
>>>>>>> I don't see other flags that are relevant here, maybe others can chime
>>>>>>> in.
>>>>>>>
>>>>>>> For --rpc_service_queue_length, it configs the size of the RPC request
>>>>>>> queues. The queue helps to buffer requests in case if there is a bunch 
>>>>>>> of
>>>>>>> them coming at once and service threads are too busy processing
>>>>>>> already arrived requests. But I don't see it can help with handling more
>>>>>>> concurrent requests.
>>>>>>>
>>>>>>> Best,
>>>>>>> Hao
>>>>>>>
>>>>>>> On Tue, Feb 12, 2019 at 6:45 PM Boris <boris...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Hao, appreciate your response.
>>>>>>>>
>>>>>>>> Do we also need to bump other RPC thread related parameters queue
>>>>>>>> etc.?
>>>>>>>>
>>>>>>>> On Tue, Feb 12, 2019, 21:09 Hao Hao <hao....@cloudera.com wrote:
>>>>>>>>
>>>>>>>>> Hi Boris,
>>>>>>>>>
>>>>>>>>> Sorry for the delay,  --rpc_num_service_threads sets the number of
>>>>>>>>> threads in RPC service thread pool (the default is 20 for tablet
>>>>>>>>> server, 10 for master).  It should help with processing concurrent 
>>>>>>>>> incoming
>>>>>>>>> RPC requests, but increasing it more than the number of available CPU 
>>>>>>>>> cores
>>>>>>>>> of the machines may not bring too much value.
>>>>>>>>>
>>>>>>>>> You don't need to set the same value for masters and tablet
>>>>>>>>> servers. Most of the time, tablet servers should have more RPCs where 
>>>>>>>>> the
>>>>>>>>> scans and writes are taking place.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Hao
>>>>>>>>>
>>>>>>>>> On Tue, Feb 12, 2019 at 5:29 PM Boris Tyukin <
>>>>>>>>> bo...@boristyukin.com> wrote:
>>>>>>>>>
>>>>>>>>>> Can someone point us to documentation or explain what these
>>>>>>>>>> parameters really mean or how they should be set on production 
>>>>>>>>>> cluster?
>>>>>>>>>> I will greatly appreciate it!
>>>>>>>>>>
>>>>>>>>>> Boris
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 8, 2019 at 3:40 PM Boris Tyukin <
>>>>>>>>>> bo...@boristyukin.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>>
>>>>>>>>>>> we need to process 1000s of operations per second and noticed
>>>>>>>>>>> that our Kudu 1.5 cluster was only using 10 threads while our 
>>>>>>>>>>> application
>>>>>>>>>>> spins up 50 clients/threads. We observed in the web UI that only 10 
>>>>>>>>>>> threads
>>>>>>>>>>> are working and other 40 waiting in the queue.
>>>>>>>>>>>
>>>>>>>>>>> We found rpc_num_service_threads parameter in the configuration
>>>>>>>>>>> guide but it is still not clear to me what we need to adjust 
>>>>>>>>>>> exactly to
>>>>>>>>>>> allow Kudu to handle more concurrent operations.
>>>>>>>>>>>
>>>>>>>>>>> Do we bump this parameter below or we need to consider other
>>>>>>>>>>> rpc related parameters?
>>>>>>>>>>>
>>>>>>>>>>> Also do we need to use the same numbers for Masters and tablets?
>>>>>>>>>>>
>>>>>>>>>>> Is there any good numbers to target based on CPU core count?
>>>>>>>>>>>
>>>>>>>>>>> --rpc_num_service_threads
>>>>>>>>>>> <https://kudu.apache.org/docs/configuration_reference.html#kudu-master_rpc_num_service_threads>
>>>>>>>>>>> <https://kudu.apache.org/docs/configuration_reference.html#kudu-master_rpc_num_service_threads>
>>>>>>>>>>>
>>>>>>>>>>> Number of RPC worker threads to run
>>>>>>>>>>>
>>>>>>>>>>> Type
>>>>>>>>>>>
>>>>>>>>>>> int32
>>>>>>>>>>>
>>>>>>>>>>> Default
>>>>>>>>>>>
>>>>>>>>>>> 10
>>>>>>>>>>>
>>>>>>>>>>> Tags
>>>>>>>>>>>
>>>>>>>>>>> advanced
>>>>>>>>>>>
>>>>>>>>>>

Reply via email to