We’re currently developing a streaming Dataflow pipeline using the latest version of the Python Beam SDK.
The pipeline does a number of transformations/aggregations, before attempting to write to BigQuery. We're peaking at ~250 elements/sec going into the writeToBigQuery step, however, we're seeing v poor performance in the pipeline, needing to scale to a considerable number of workers, and often seeing the entire pipeline 'freeze' with throughput dropping to zero at all stages, for ~30 min periods. The number of unacked messages keeps growing (so it looks like the pipeline could never catch-up). The wall time on the WriteToBQ steps is considerably higher than the rest of the stages in the pipeline. If we run another version of the Dataflow job, removing the WriteToBigQuery step - performance is *dramatically* improved. System lag is minimal and the approx 1/3 of the number of vCPUs is required to keep on top of the incoming messages. Are there known limitations with WriteToBigQuery in the Python SDK? We have had our quota raised by Google, so limits on streaming inserts shouldn’t be an issue.
