Beam pipeline should fail when one FILE_LOAD fails for BigQueryIO on Flink

Kaymak, Tobias Wed, 13 Feb 2019 06:28:27 -0800

Hello,

I have a BigQuery table which ingests a Protobuf stream from Kafka with a
Beam pipeline. The Protobuf has a `log Map<String,String>` column which
translates to a field "log" of type RECORD with unknown fields in BigQuery.


So I scanned my whole stream to know which schema fields to expect and
created an empty daily-partitioned table in BigQuery with correct fields
for this RECORD type.

Now someone is pushing a new field into this RECORD and the import fails as
the schema is not matching anymore. My pipeline is configured to use
FILE_LOADS and so these do fail - but the Flink pipeline just continues and
does not throw any error.

My Beam version is 2.10-SNAPSHOT running on Flink 1.6, my two questions are:

1. How can I make the pipeline fail hard in this situation?
2. How can I prevent this from happening? If I create the table with the
correct schema in the beginning the table schema seems to be overwritten
and autodetected again (without the added field). This is causing the load
to fail. If I alter the schema and add the field manually when the pipeline
is running it fixes it, but then I already have a dozen of failed imports.
The main issue seems to be that I have a Protobuf schema which only defines
a Map<String,String> type, that is not specific enough to create the full
matching schema from it.

Beam pipeline should fail when one FILE_LOAD fails for BigQueryIO on Flink

Reply via email to