We’re currently developing a streaming Dataflow pipeline using the latest 
version of the Python Beam SDK. 

The pipeline does a number of transformations/aggregations, before attempting 
to write to BigQuery. We're peaking at ~250 elements/sec going into the 
writeToBigQuery step, however, we're seeing v poor performance in the pipeline, 
needing to scale to a considerable number of workers, and often seeing the 
entire pipeline 'freeze' with throughput dropping to zero at all stages, for 
~30 min periods. 

The number of unacked messages keeps growing (so it looks like the pipeline 
could never catch-up). The wall time on the WriteToBQ steps is considerably 
higher than the rest of the stages in the pipeline.

If we run another version of the Dataflow job, removing the WriteToBigQuery 
step - performance is *dramatically* improved. System lag is minimal and the 
approx 1/3 of the number of vCPUs is required to keep on top of the incoming 
messages.

Are there known limitations with WriteToBigQuery in the Python SDK? We have had 
our quota raised by Google, so limits on streaming inserts shouldn’t be an 
issue.

Reply via email to