Hi,

We have a Dataflow job that loads data from GCS, does a bit of transformation, 
then writes to a number of BigQuery tables using DynamicDestinations.

The same job runs on smaller data sets (~70 million records), but this one is 
struggling when processing ~500 million records. Both jobs are writing to the 
same amount of tables - the only difference is the amount of records.

Example job IDs include 2018-03-02_04_29_44-2181786949469858712 and 
2018-03-02_08_46_28-4580218739500768796. They are using BigQuery.IO to write to 
BigQuery, using the BigQueryIO.Write.Method.FILE_LOADS method (the default for 
a bounded job). They successfully stage all their data to GCS, but then for 
some reason scale down the amount of workers to 1 when processing the step 
WriteTOBigQuery/BatchLoads/ReifyResults and stay in that step for hours.

In the logs we see many entries like this:

Proposing dynamic split of work unit 
...-7e07;2018-03-02_04_29_44-2181786949469858712;662185752552586455 at 
{"fractionConsumed":0.5}
Rejecting split request because custom reader returned null residual source.
And also occasionally this:

Processing lull for PT24900.038S in state process of 
WriteTOBigQuery/BatchLoads/ReifyResults/ParDo(Anonymous) at 
java.net.SocketInputStream.socketRead0(Native Method) at 
java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at 
java.net.SocketInputStream.read(SocketInputStream.java:170) at 
java.net.SocketInputStream.read(SocketInputStream.java:141) ...

The job does seem to eventually progress, but after many hours. It then fails 
later with this error, which may or may not be related (just starting to look 
in to):

(94794e1a2c96f380): java.lang.RuntimeException: 
org.apache.beam.sdk.util.UserCodeException: java.io.IOException: Unable to 
patch table description: {datasetId=..., projectId=..., 
tableId=9c20908cc6e549b4a1e116af54bb8128_011249028ddcc5204885bff04ce2a725_00001_00000},
 aborting after 9 retries.

We're not sure how to proceed, so any pointers would be appreciated.

Thanks,
Andrew

Reply via email to