Hi Shawn, Answers inline below
On Tue, Oct 17, 2017 at 12:59 PM, Shawn Terry <[email protected]> wrote: > We ran into a problem today that looks like it might be related to this: > > https://issues.apache.org/jira/browse/KUDU-1891 > > We had a client app crash with this same kind of error: “not enough > mutation buffer space remaining for operation”. Currently the client app > was queuing up a number of writes and doing manual flushing at the end of > the set of transactions. > This means that the configured mutation buffer size for the KuduSession object was not large enough to handle all of the operations that you wrote before flushing. The default is 7MB, but it could be configured safely to be a bit larger at the expense of memory. > We’re using the kudu-python api and would like to better understand the > behavior of the different flushing modes… (assuming > SessionConfiguration.FlushMode is the thing we should be looking at). > Since the Python API wraps the C++ API it's best to look at the C++ client docs here. See https://kudu.apache.org/cpp-client-api/classkudu_1_1client_1_1KuduSession.html#aaec3956e642610d703f3b83b78e24e19 for docs on the various flush modes. > Are there any global settings to tweak to allow a larger buffer? What > would be the pro’s and con’s of this? > At a certain size you will hit errors that the maximum RPC size has been crossed, and then your writes will fail. Additionally, flushing a larger buffer at a time implies higher latency for that flush (since it's doing more work). > Would explicitly using KuduSession.setFlushMode(AUTO_FLUSH_SYNC) make any > difference? > > AUTO_FLUSH_SYNC means that each operation that you Apply (eg an insert or update) makes its own separate round trip to the appropriate server before responding. This will be very slow if your goal is to stream a high volume of writes into Kudu. It is most appropriate for an online application where you mght want to do only a few inserts in response to some web request, etc. AUTO_FLUSH_BACKGROUND is typically the best choice for a streaming ingest or bulk load scenario since it aims to manage buffer sizes for you automatically for best performance. We'll continue to invest on making AUTO_FLUSH_BACKGROUND work as well as possible for these scenarios. -Todd -- Todd Lipcon Software Engineer, Cloudera
