Re: Inferring Csv Schemas

Joe Cullen Fri, 30 Nov 2018 11:39:02 -0800

That's great Reza, thanks! I'm still getting to grips with Beam and
Dataflow so apologies for all the questions. I have a few more if that's ok:

1. When the article says "the schema would be mutated", does this mean the
BigQuery schema?
2. Also, when the known good BigQuery schema is retrieved, and if it's the
BigQuery schema being updated in the question above, is this done with the
BigQuery API rather than BigQueryIO? In other words, what is the process
behind the step "validate and mutate BQ schema" in the image?

Thanks,
Joe

On 30 Nov 2018 16:48, "Reza Rokni" <[email protected]> wrote:

Hi Joe,

That part of the blog should have been written a bit cleaner.. I blame the
writer ;-) So while that solution worked it was inefficient, this is
discussed in the next paragraph.. But essentially checking the validity of
the schema every time is not efficient, especially as they are normally ok.
So the next paragraph was..

*However, this design could not make use of the inbuilt efficiencies that
BigQueryIO provided, and also burdened us with technical debt.Chris then
tried various other tactics to beat the boss. In his words ... "The first
attempt at fixing this inefficiency was to remove the costly JSON schema
detection ‘DoFn’ which every metric goes through, and move it to a ‘failed
inserts’ section of the pipeline, which is only run when there are errors
on inserting into BigQuery,”*

Cheers
Reza

On Fri, 30 Nov 2018 at 09:01, Joe Cullen <[email protected]> wrote:

> Thanks Reza, that's really helpful!
>
> I have a few questions:
>
> "He used a GroupByKey function on the JSON type and then a manual check on
> the JSON schema against the known good BigQuery schema. If there was a
> difference, the schema would mutate and the updates would be pushed
> through."
>
> If the difference was a new column had been added to the JSON elements,
> does there need to be any mutation? The JSON schema derived from the JSON
> elements would already have this new column, and if BigQuery allows for
> additive schema changes then this new JSON schema should be fine, right?
>
> But then I'm not sure how you would enter the 'failed inserts' section of
> the pipeline (as the insert should have been successful).
>
> Have I misunderstood what is being mutated?
>
> Thanks,
> Joe
>
> On Fri, 30 Nov 2018, 11:07 Reza Ardeshir Rokni <[email protected] wrote:
>
>> Hi Joe,
>>
>> You may find some of the info in this blog of interest, its based on
>> streaming pipelines but useful ideas.
>>
>>
>> https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix
>>
>> Cheers
>>
>> Reza
>>
>> On Thu, 29 Nov 2018 at 06:53, Joe Cullen <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a pipeline reading CSV files, performing some transforms, and
>>> writing to BigQuery. At the moment I'm reading the BigQuery schema from a
>>> separate JSON file. If the CSV files had a new column added (and I wanted
>>> to include this column in the resultant BigQuery table), I'd have to change
>>> the JSON schema or the pipeline itself. Is there any way to autodetect the
>>> schema using BigQueryIO? How do people normally deal with potential changes
>>> to input CSVs?
>>>
>>> Thanks,
>>> Joe
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Inferring Csv Schemas

Reply via email to