Custom batching for BigQuery streaming inserts

Julien Phalip Tue, 23 Nov 2021 20:48:34 -0800

Hi,

AFAIK there are two types of batching/sharding for BigQuery streaming
inserts: 1) Hard-coded batch size via the `--numStreamingKeys` pipeline
option, and 2) Automatic sharing via `withAutoSharding()`.


Instead, I'd like to do my own batching and provide my own GroupIntoBatches
implementation. More precisely, I'd like to batch rows by overall bytesize
instead of by number of rows. The reason is that some individual rows might
potentially be very large, which I believe could cause some streaming
insert requests to fail because their overall payloads would be too large
and be rejected by the BigQuery API.

However, I'm not sure that is possible from looking at the Java SDK
internals, as `BigQueryIO.write()` appears to only accept
`PCollection<TableRow>`. Ideally I'd like to instead provide an input of
pre-batched rows in the form of `Iterable<TableRow>`.

>From looking at the Python SDK, it looks like that might be possible by
setting `BigQueryWriteFn.with_batched_input=True`.

Is what I'm trying to achieve possible with the Java SDK?

Thanks!

Julien

Custom batching for BigQuery streaming inserts

Reply via email to