As you said, I expect it depends on many variables. I ran a quick & dirty experiment when first evaluating kudu 1.0 to see how flushing at varying intervals affected insert rates. I had one master and one tserver, each in the default configuration, on an ext4 filesystem on a spinning disk. The table had two string columns "key" and "value", both part of the primary key, each less than 30 bytes. Here were the results:
Manual flush every insert: 100K inserts in 14.5s (~7K/s) Manual flush every 100K: 1M inserts in 4.7s (~215K/s, w/ warnings about "blocked reactor thread") Manual flush every 10K: 1M inserts in 4.2s (~240K/s) Auto flush background, no explicit flush: 1M inserts in 4.8s (w/ warnings about "blocked reactor thread" and "thread stuck") Auto flush background, explicit flush every 10K inserts: 1M inserts in 4.2s (~240K/s) Async flush every 10K inserts: 1M inserts in 2.8s (~350K/s) Async flush every 1K inserts: 1M inserts in 2.7s (~370K/s) Async flush every 100: 1M inserts in 3.3s (~300K/s) Async flush every 10: 1M inserts in 10.6s (~95K/s) Based on this experiment, I chose async flush with a 1K interval, because beyond that there is diminishing return, and I don't want to run out of mutation space. On Tue, Feb 28, 2017 at 6:29 AM, Nicolas Fouché <[email protected]> wrote: > Hi. Is there any recommendation on the number of operations in > bulk/AUTO_FLUSH_BACKGROUND ? I guess it highly depends on the cluster size, > the number of partitions hit by the operations, etc. But there could be > some guidelines out there ? > > > Looking at the code of the kudu client, it seems that the default size is > 1000: `private int mutationBufferSpace = 1000;`. > > - Nicolas >
