Hi!

I have setup a beam pipeline to read from a postGreSQL server and write to
BigQuery table and it takes for ever even for size of 20k records. While
investingating the process, I found that the pipeline created one file per
record and then a new job is requested for each file(!) which makes the
pipeline unscalable.

Searching on the Internet I found a very similar question on SO but without
any answer:
https://stackoverflow.com/questions/49276833/slow-bigquery-load-job-when-data-comes-from-apache-beam-jdbcio
The other user has made an experiment by replacing JdbcIO with TextIO and
(s)he couldn't replicate the behavior so it seems that the problem here is
the combination of JdbcIO + BigQueryIO and maybe even postgres has
something to do also.

I then tried to manually debug the execution using direct runner and local
debugger and I found out that the problem is with bundling as it creates
bundles of size 1. I suppose this has to do with "Dependent-parallelism
between transforms" of the execution model
https://beam.apache.org/documentation/runtime/model/#dependent-parallellism.

I have tried many things to override this but no luck including:
 * Adding an intermediate mapping step
 * Removing dynamic destination table
 * Adding a shuffle operation using the Reparallelize<> class from GCP
example template Jdbc -> Bigquery
https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/5f245fb1e2223ea42d5093863816a41391a548dd/src/main/java/com/google/cloud/teleport/io/DynamicJdbcIO.java#L405

Is this supposed to happen? It looks like a bug to me that the default
behavior makes bundles of size 1. Also I feel that BigQueryIO should try to
batch outputs in bigger jobs to increase efficiency.

Is there a way to override bundling somehow?
Any other suggestion to override this?


Cheers!
Konstantinos

Reply via email to