Thanks Radek! Very delayed response here... but we've solved the problem now. We finally just went through a somewhat painful migration to a new dataset where the tables were partitioned correctly, flipped the flags on all our jobs at the same time, etc etc. A bit painful but I think less risky than trying to change to the StorageWriteAPI with whatever differences might have cropped up with that. We could've dealt with the truncation by having two separate write streams, but this way we didn't have to.
Thanks! -Lina On Wed, Mar 19, 2025 at 10:58 AM Radek Stankiewicz <radosl...@google.com> wrote: > Hi Lina, > thanks for confirming that you are using write_truncate - that's a bummer > as it's not possible to truncate with storage_write_api, it's append only. > As it is a batch I imagine your job could have a step where you truncate > the table to overcome this limitation (e.g. via DML). > Have in mind that storage write api has 2 methods - at least once and > exactly once. For exactly once I would recommend control amount of streams > - withNumStorageWriteApiStreams as you may deplete the quota quickly. > Storage Write API is using cross language transform - for dataflow it will > work right away - let me know how you run your pipeline. > > Radek > > > > > > > > On Wed, Mar 19, 2025 at 6:12 PM Lina Mårtensson <lina@camus.energy> wrote: > >> Thanks Radek! >> >> I didn't realize that writing is done with a copy job - then I understand >> why we need to configure partitioning as well. And that all makes sense. >> >> We haven't tried the storage write API - that wasn't available for Python >> yet when we started doing this. I will take a look at it and see if that >> might work for us and get back to you. Would using the storage write API >> allow us to write to a partitioned table without specifying so? >> I should mention as well that streaming the data would be out of the >> question, since we need to use mutating DML statements as well. >> >> Our call looks like this: >> "Write to BigQuery" >> WriteToBigQuery( >> >> project=self.project_id, >> >> table=self.get_table_name, >> >> dataset=self.dataset, >> schema=self.get_schema, >> >> schema_side_inputs=(beam.pvalue.AsDict(types_for_schema),), >> >> create_disposition=BigQueryDisposition.CREATE_IF_NEEDED, >> write_disposition=( >> >> BigQueryDisposition.WRITE_TRUNCATE >> >> if self.overwrite >> >> else BigQueryDisposition.WRITE_APPEND >> >> ), >> additional_bq_parameters={ >> "schemaUpdateOptions": ["ALLOW_FIELD_ADDITION"], >> } >> ) >> >> On Wed, Mar 19, 2025 at 3:12 AM Radek Stankiewicz <radosl...@google.com> >> wrote: >> >>> hi Lina, >>> there are multiple reasons why copy job is used with temporary table; >>> - you may be using dynamic destinations >>> - you are loading lots of data, probably with truncate >>> This way we ensure atomicity as we can trigger copy from multiple temp >>> tables into one final table. >>> Can you confirm or paste a snippet how you configured >>> pache_beam.io.gcp.bigquery.WriteToBigQuery ? >>> >>> BigQuery doesn't allow copying non partitioned tables into partitioned >>> tables - it's a BQ limitation. >>> >>> have you tried other loading methods .e.g storage write api? >>> >>> Radek >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Mar 19, 2025 at 5:42 AM Lina Mårtensson via user < >>> user@beam.apache.org> wrote: >>> >>>> Hi, >>>> >>>> We have, by now, a large set of different Beam jobs all written in >>>> Python that all write to a set of BigQuery tables that more or less behave >>>> the same way in a single dataset. These tables aren't partitioned at all, >>>> but going forward, we need them to be. >>>> >>>> I partitioned a single table to start with, and was very surprised to >>>> find that a Beam job that wrote to it couldn't do so: >>>> >>>> Failed to copy Non partitioned table to Column partitioned table: not >>>> supported. >>>> >>>> >>>> We have a bunch of pre-created tables, and I would've thought I could >>>> just keep writing to those without changing settings even after setting up >>>> partitioning on them. It doesn't seem to matter whether the >>>> create_disposition is CREATE_IF_NEEDED or CREATE_NEVER. >>>> It does work when I set the additional_bq_parameters to add >>>> time_partitioning, but it would be a huge undertaking not only to >>>> update all of our currently running jobs across many projects, but also to >>>> make sure to synchronize these changes with updating the underlying >>>> BigQuery tables. And it doesn't seem like it should be necessary to specify >>>> if we're not creating any new tables? >>>> >>>> Is there any way to just write the data we have to pre-created, >>>> partitioned tables without having to set time_partitioning in >>>> additional_bq_parameters, or potentially if there's some other >>>> recommended way to solve this problem? >>>> >>>> Thanks, >>>> -Lina >>>> >>>