In case curious/for posterity, so far I'm doing ingest into PostgreSQL outside of Beam entirely using https://github.com/uktrade/pg-bulk-ingest. It handles auto migration of the table, ingests data in batches where each is a single transaction, and stores a high watermark against the table that's read on subsequent ingests,
Whether or not this can somehow be integrated with Beam, not too sure. On Sat, May 6, 2023 at 4:09 PM Michal Charemza <mic...@charemza.name> wrote: > Ah thanks for your reply! > > So done manually/outside - I was hoping to have this in the pipeline > really so that DDLs would be done in a transaction along with the data > ingest. And even if not quite that, would be great to have all the > stages/code defined and run by the same thing, and not have something > running out of pipeline > > The automatic migration thing... will have a nose around. That's close to > what I'm looking for. Although maybe not quite. I think what I would like > is some sort of sink transform that accepts as a parameter a SQLAlchemy > table definition, and automatically migrates any existing table to that > (where it can). Although I don't know Beam quite enough yet to know if > that's really what I want... > > (Note also - the word "application" here... there isn't really an > application here - am using PostgreSQL as I guess a data warehouse) > > Michal > > On Sat, May 6, 2023 at 3:18 PM Pavel Solomin <p.o.solo...@gmail.com> > wrote: > >> Hello! >> >> Usually DDLs (create table / alter table) live outside of the >> applications. From my experience I've seen that sort of tasks being done >> either manually or via automations like Liquibase / Flyway. This is not >> specific to Beam, it is a common pattern of backend / data engineering apps >> development. >> >> Some applications may have support of most simple and conflict-free DDLs >> like adding a nullable column without restarting the app itself. >> >> I remember I've seen some examples in Java and Python for Beam apps which >> supported schema automatic migrations. Example: >> >> >> >> https://medium.com/inside-league/streaming-data-to-bigquery-with-dataflow-and-real-time-schema-updating-c7a3deba3bad >> >> I am not aware of automatic solutions for arbitrary schema changes though. >> >> On Saturday, 6 May 2023, Michal Charemza <mic...@charemza.name> wrote: >> > I'm looking into using Beam to ingest from various sources into a >> PostgreSQL database, but there is something that I don't quite know how to >> fit into the Beam model. How to deal with "non data" tasks that would need >> to happen before or after the pipeline proper? >> > For example, creation of tables, renames of tables, migrations on >> existing tables. Where should all this sort of code/logic live if the >> fetch/ingestion of data is via Beam? Or - is this entirely outside of the >> Beam model? It should happen before the pipeline, or after the pipeline, >> but not as part of the pipeline? >> >> -- >> Best Regards, >> Pavel Solomin >> >> Tel: +351 962 950 692 | Skype: pavel_solomin | Linkedin >> <https://www.linkedin.com/in/pavelsolomin> >> >> >> >> >>