That's great Reza, thanks! I'm still getting to grips with Beam and Dataflow so apologies for all the questions. I have a few more if that's ok:
1. When the article says "the schema would be mutated", does this mean the BigQuery schema? 2. Also, when the known good BigQuery schema is retrieved, and if it's the BigQuery schema being updated in the question above, is this done with the BigQuery API rather than BigQueryIO? In other words, what is the process behind the step "validate and mutate BQ schema" in the image? Thanks, Joe On 30 Nov 2018 16:48, "Reza Rokni" <[email protected]> wrote: Hi Joe, That part of the blog should have been written a bit cleaner.. I blame the writer ;-) So while that solution worked it was inefficient, this is discussed in the next paragraph.. But essentially checking the validity of the schema every time is not efficient, especially as they are normally ok. So the next paragraph was.. *However, this design could not make use of the inbuilt efficiencies that BigQueryIO provided, and also burdened us with technical debt.Chris then tried various other tactics to beat the boss. In his words ... "The first attempt at fixing this inefficiency was to remove the costly JSON schema detection ‘DoFn’ which every metric goes through, and move it to a ‘failed inserts’ section of the pipeline, which is only run when there are errors on inserting into BigQuery,”* Cheers Reza On Fri, 30 Nov 2018 at 09:01, Joe Cullen <[email protected]> wrote: > Thanks Reza, that's really helpful! > > I have a few questions: > > "He used a GroupByKey function on the JSON type and then a manual check on > the JSON schema against the known good BigQuery schema. If there was a > difference, the schema would mutate and the updates would be pushed > through." > > If the difference was a new column had been added to the JSON elements, > does there need to be any mutation? The JSON schema derived from the JSON > elements would already have this new column, and if BigQuery allows for > additive schema changes then this new JSON schema should be fine, right? > > But then I'm not sure how you would enter the 'failed inserts' section of > the pipeline (as the insert should have been successful). > > Have I misunderstood what is being mutated? > > Thanks, > Joe > > On Fri, 30 Nov 2018, 11:07 Reza Ardeshir Rokni <[email protected] wrote: > >> Hi Joe, >> >> You may find some of the info in this blog of interest, its based on >> streaming pipelines but useful ideas. >> >> >> https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix >> >> Cheers >> >> Reza >> >> On Thu, 29 Nov 2018 at 06:53, Joe Cullen <[email protected]> >> wrote: >> >>> Hi all, >>> >>> I have a pipeline reading CSV files, performing some transforms, and >>> writing to BigQuery. At the moment I'm reading the BigQuery schema from a >>> separate JSON file. If the CSV files had a new column added (and I wanted >>> to include this column in the resultant BigQuery table), I'd have to change >>> the JSON schema or the pipeline itself. Is there any way to autodetect the >>> schema using BigQueryIO? How do people normally deal with potential changes >>> to input CSVs? >>> >>> Thanks, >>> Joe >>> >> -- This email may be confidential and privileged. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person. The above terms reflect a potential business arrangement, are provided solely as a basis for further discussion, and are not intended to be and do not constitute a legally binding obligation. No legally binding obligations will be created, implied, or inferred until an agreement in final form is executed in writing by all parties involved.
