Re: Updating a Running Dataflow Job

gaurav mishra Tue, 11 Jan 2022 15:58:42 -0800

That's a valid point. I can live with the current solution for now. And
hope that Dataflow will add that support soon.


Another question regarding JavaBeanSchema implementation, I noticed we are
using clazz.getDeclaredMethods() for preparing the schema. This ignores the
base class's getters, is there a reason we are not using clazz.getMethods()
for preparing the schema.

On Tue, Jan 11, 2022 at 3:24 PM Reuven Lax <[email protected]> wrote:

> What matters is the physical order of the fields in the schema encoding.
> If you add gaps, you'll be fine initially (since the code I added simply
> sorts by the FieldNumber), however you'll have problems on update.
>
> For example, let's say you have the following
>
>    SchemaFieldNumber("0") String fieldOne;
>    SchemaFieldNumber("50") String fieldTwo;
>
> If you remove the check, things will appear just fine. However if you then
> add a new field number in that gap, you'll have problems:
>
>    SchemaFieldNumber("0") String fieldOne;
>    SchemaFieldNumber("1") String fieldThree;
>    SchemaFieldNumber("50") String fieldTwo;
>
> Now fieldThree will be expected to be the second field encoded, and you
> won't be able to update.
>
> Dataflow handles schema updates by keeping track of the full map of field
> name -> encoding position from the old job and telling the coders of the
> new job to maintain the old encoding positions; this forces all new fields
> to be added to the end of the encoding. I think the long-term solution here
> is for Dataflow to support this on state variables as well.
>
> Reuven
>
>
> On Tue, Jan 11, 2022 at 1:55 PM gaurav mishra <
> [email protected]> wrote:
>
>> Hi Reuven,
>> In your patch I just dropped the check for gaps in the numbers and
>> introduced gaps in the annotated field numbers in my beans. Generated
>> schema seems to be consistent and still works across pipeline updates.
>>
>> Preconditions.checkState(
>>              number == i,
>>              "Expected field number "
>>                  + i
>>                  + " for field: "
>>                  + type.getName()
>>                  + " instead got "
>>                  + number);
>>
>> Is there any reason why we are checking for gaps there?
>>
>> On Mon, Jan 10, 2022 at 5:37 PM gaurav mishra <
>> [email protected]> wrote:
>>
>>>
>>>
>>> On Mon, Jan 10, 2022 at 11:42 AM Reuven Lax <[email protected]> wrote:
>>>
>>>> Gaurav,
>>>>
>>>> Can you try this out and tell me if it works for you?
>>>> https://github.com/apache/beam/pull/16472
>>>>
>>>> This allows you to annotate all your getters with SchemaFieldNumber, to
>>>> specify the order of fields.
>>>>
>>> Hi Reuven
>>> I am able to build and try out your code and it seems to be working fine
>>> so far. I will continue to do more testing with this build. But I do see a
>>> small issue with the current implementation. In the documentation it
>>> says there are no gaps allowed in those numbers which makes it hard to use
>>> this on classes which inherit from another class which uses this
>>> annotation.
>>> In my case I have an input Bean and an Output Bean that extends the
>>> Input bean. I have to configure Coders for both these Beans. Input Bean
>>> today uses number 0-40, and output Bean uses 41-46. If tomorrow I add a new
>>> field to the input bean then I will have to use 47 on that field. to
>>> avoid duplicates in output Bean. The only solution around this problem I
>>> can think of is to allow for gaps in the number.
>>>
>>>
>>>>
>>>> Reuven
>>>>
>>>> On Mon, Jan 10, 2022 at 9:28 AM Reuven Lax <[email protected]> wrote:
>>>>
>>>>> Yes, timers will get carried over.
>>>>>
>>>>> On Mon, Jan 10, 2022 at 9:19 AM gaurav mishra <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> That DelegateCoder seems like a good idea to keep the DoFn code
>>>>>> clean. In the short term I will perhaps use json string as my 
>>>>>> intermediary
>>>>>> and will let Jackson do the serialization and deserialization. from what 
>>>>>> I
>>>>>> have learned so far is that Dataflow does some magic in the background to
>>>>>> re-order fields to make the PCollection schema compatible to the previous
>>>>>> job (if I use SchemaCoder), So I don't have to make a switch from Java
>>>>>> beans yet. I can use the DelegateCoder for my statespec and continue
>>>>>> using SchemaCoder for PCollections.
>>>>>>
>>>>>> @Reuven
>>>>>> Regarding timers in stateful DoFn, do they get carried over to the
>>>>>> new job in Dataflow?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 10, 2022 at 3:44 AM Pavel Solomin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Gaurav,
>>>>>>>
>>>>>>> It looks like the issues related to state evolution is something
>>>>>>> quite recurrent in this mailing list, and do not seem to be covered in 
>>>>>>> the
>>>>>>> Beam docs, unfortunately.
>>>>>>>
>>>>>>> I am not familiar with Google Dataflow runner, but the way I
>>>>>>> achieved state evolution on Flink runner was a combo of:
>>>>>>> 1 - version-controlled Avro schemas for generating POJOs (models)
>>>>>>> which were used in stateful Beam DoFns
>>>>>>> 2 - a generic Avro-generated POJO as an intermediary - ( schema:
>>>>>>> string, payload: bytes )
>>>>>>> 3 -
>>>>>>> https://beam.apache.org/releases/javadoc/2.35.0/org/apache/beam/sdk/coders/DelegateCoder.html
>>>>>>> responsible for converting a model into an intermediary (2)
>>>>>>>
>>>>>>> This allowed adding & dropping nullable fields and to do other
>>>>>>> Avro-forward compatible changes without loosing the state. The order of
>>>>>>> fields also didn't matter in this setting. And this also meant double 
>>>>>>> ser-
>>>>>>> and deserialization, unfortunately. But I wasn't able to come to any 
>>>>>>> better
>>>>>>> solution, given that there seemed to be no public info available about 
>>>>>>> the
>>>>>>> topic, or, maybe, due to the lack of my experience.
>>>>>>>
>>>>>>> Relevant mail threads:
>>>>>>> https://www.mail-archive.com/[email protected]/msg07249.html
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Pavel Solomin
>>>>>>>
>>>>>>> Tel: +351 962 950 692 <+351%20962%20950%20692> | Skype:
>>>>>>> pavel_solomin | Linkedin <https://www.linkedin.com/in/pavelsolomin>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 10 Jan 2022 at 08:43, gaurav mishra <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jan 10, 2022 at 12:19 AM Reuven Lax <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Unfortunately, as you've discovered, Dataflow's update only pays
>>>>>>>>> attention to actual PCollections, and does not look at DoFn state 
>>>>>>>>> variables
>>>>>>>>> at all (historically, stateful DoFns did not yet exist when Dataflow
>>>>>>>>> implemented its update support).
>>>>>>>>>
>>>>>>>>> You can file a feature request against Google to implement update
>>>>>>>>> on state variables, as this is honestly a major feature gap. In the
>>>>>>>>> meantime there might be some workarounds you can use until this 
>>>>>>>>> support is
>>>>>>>>> added. e.g.
>>>>>>>>>    - you could convert the data into a protocol buffer before
>>>>>>>>> storing into state; ProtoCoder should handle new fields, and fields 
>>>>>>>>> in a
>>>>>>>>> protocol buffer have a deterministic order.
>>>>>>>>>
>>>>>>>> Is there an easy way to go from a java object to a protocol buffer
>>>>>>>> object. The only way I can think of is to first serialize the java 
>>>>>>>> model to
>>>>>>>> json and then deserialize the json into a proto buffer object.  Same 
>>>>>>>> for
>>>>>>>> the other way around - protobuf to java object.  while I was writing 
>>>>>>>> this I
>>>>>>>> realized I can maybe store the json dump of my java object in state 
>>>>>>>> spec.
>>>>>>>> That should also work.
>>>>>>>>
>>>>>>>>>    - you could explicitly use the Row object, although this will
>>>>>>>>> again make your code more complex.
>>>>>>>>>
>>>>>>>>
>>>>>>>>> Another intermediate option - we could fairly easily add
>>>>>>>>> a @SchemaFieldId annotation to Beam that would allow you to specify 
>>>>>>>>> the
>>>>>>>>> order of fields. You would have to annotate every getter, something 
>>>>>>>>> like
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> @ScheamFieldId(0)
>>>>>>>>> public T getV1();
>>>>>>>>>
>>>>>>>>> @SchemaFieldId(1)
>>>>>>>>> public T getV2();
>>>>>>>>>
>>>>>>>> Yes please. If this is an easy change and If you can point me in
>>>>>>>> the right direction I will be happy to submit a patch for this.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> etc.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jan 10, 2022 at 12:05 AM Reuven Lax <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> SerializableCoder can be tricky for updates, and can also be very
>>>>>>>>>> inefficient.
>>>>>>>>>>
>>>>>>>>>> As you noticed, when inferring schemas from a Java class, the
>>>>>>>>>> order of fields is non deterministic (not much that we can do about 
>>>>>>>>>> this,
>>>>>>>>>> as Java doesn't return the fields in a deterministic order). For 
>>>>>>>>>> regular
>>>>>>>>>> PCollections, Dataflow (though currently only Dataflow) will allow 
>>>>>>>>>> update
>>>>>>>>>> even though the field orders have changed. Unfortunately
>>>>>>>>>>
>>>>>>>>>> On Sun, Jan 9, 2022 at 10:14 PM gaurav mishra <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I now have SchemaCoder for all data flowing through
>>>>>>>>>>> the pipeline.
>>>>>>>>>>> I am noticing another issue once a new pipeline is started with
>>>>>>>>>>> --update flag. In my stateful DoFn where I am doing 
>>>>>>>>>>> prevState.read(),
>>>>>>>>>>> exceptions are being thrown. errors like
>>>>>>>>>>> Caused by: java.io.EOFException: reached end of stream after
>>>>>>>>>>> reading 68 bytes; 101 bytes expected
>>>>>>>>>>> In this case there was no change in code or data model when
>>>>>>>>>>> relaunching the pipeline.
>>>>>>>>>>>
>>>>>>>>>>> It seems that Schema for my Bean is changing on every run of the
>>>>>>>>>>> pipeline. I manually inspected the Schema returned by the 
>>>>>>>>>>> SchemaRegistry by
>>>>>>>>>>> running the pipeline a few times locally and each time I noticed 
>>>>>>>>>>> the order
>>>>>>>>>>> of fields in the Schema was different. This changing of order of 
>>>>>>>>>>> fields is
>>>>>>>>>>> breaking the new pipeline. Is there a way to enforce some kind of 
>>>>>>>>>>> ordering
>>>>>>>>>>> on the fields?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Jan 9, 2022 at 7:37 PM gaurav mishra <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Looking at source code for the JavaBeamSchema it seems I will
>>>>>>>>>>>> have to annotate my getter and setters with any flavour of 
>>>>>>>>>>>> @Nullable.
>>>>>>>>>>>> Another question -
>>>>>>>>>>>> does the dataflow update pipeline feature work only with
>>>>>>>>>>>> SchemaCoder? I have SerializableCoder and ProtoCoder being used in 
>>>>>>>>>>>> some of
>>>>>>>>>>>> my other pipelines. I wonder if I have to change those too to open 
>>>>>>>>>>>> doors
>>>>>>>>>>>> for updating pipelines there?
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Jan 9, 2022 at 6:44 PM gaurav mishra <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Some progress made.
>>>>>>>>>>>>> Now I can see that schema has all the needed fields. But all
>>>>>>>>>>>>> fields in the returned schema are marked as "NOT NULL". even 
>>>>>>>>>>>>> though the
>>>>>>>>>>>>> fields are annotated with @javax.annotation.Nullable.
>>>>>>>>>>>>> Is there a different flavor of @Nullable I need to use on my
>>>>>>>>>>>>> fields?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 5:58 PM Reuven Lax <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ah, this isn't a POJO then. In this case, try using
>>>>>>>>>>>>>> DefaultSchema(JavaBeanSchema.class)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 5:55 PM gaurav mishra <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It looks something like this. It currently has around 40
>>>>>>>>>>>>>>> fields.
>>>>>>>>>>>>>>> @JsonIgnoreProperties(ignoreUnknown = true)
>>>>>>>>>>>>>>> @DefaultSchema(JavaFieldSchema.class)
>>>>>>>>>>>>>>> @EqualsAndHashCode
>>>>>>>>>>>>>>> @ToString
>>>>>>>>>>>>>>> public class POJO implements SomeInterface, Serializable {
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     public static final Integer SOME_CONSTANT_FIELD = 2;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     @JsonProperty("field_1")
>>>>>>>>>>>>>>>     private String field1;
>>>>>>>>>>>>>>>     @JsonProperty("field_2")
>>>>>>>>>>>>>>>     private String field2;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     @Nullable
>>>>>>>>>>>>>>>     @JsonProperty("field_3")
>>>>>>>>>>>>>>>     private Integer field3;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     .....
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     @JsonProperty("field_1")
>>>>>>>>>>>>>>>     public String getField1() {
>>>>>>>>>>>>>>>         return field1;
>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     @JsonProperty("field_1")
>>>>>>>>>>>>>>>     public void setField1(String field1) {
>>>>>>>>>>>>>>>         this.field1 = field1;
>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     @JsonProperty("field_2")
>>>>>>>>>>>>>>>     public String getField2() {
>>>>>>>>>>>>>>>         return field2;
>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     @JsonProperty("field_2")
>>>>>>>>>>>>>>>     public void setField2(String field2) {
>>>>>>>>>>>>>>>         this.field2 = field2;
>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     @JsonProperty("field_3")
>>>>>>>>>>>>>>>     public Integer getField3() {
>>>>>>>>>>>>>>>         return field3;
>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     @JsonProperty("field_3")
>>>>>>>>>>>>>>>     public void setField3(Integer field3) {
>>>>>>>>>>>>>>>         this.field3 = field3;
>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     public POJO() {
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>          0 arg empty constructor
>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>     public POJO(POJO other) {
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        ... copy constructor
>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 5:26 PM Reuven Lax <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you paste the code for your Pojo?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 4:09 PM gaurav mishra <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> schemaRegistry.getSchema(dataType) for me is returning a
>>>>>>>>>>>>>>>>> empty schema.
>>>>>>>>>>>>>>>>> my Pojo is annotated with
>>>>>>>>>>>>>>>>> the @DefaultSchema(JavaFieldSchema.class)
>>>>>>>>>>>>>>>>> Is there something extra I need to do here to register my
>>>>>>>>>>>>>>>>> class with a schema registry.
>>>>>>>>>>>>>>>>> Note: the code which is building the pipeline is sitting
>>>>>>>>>>>>>>>>> in a library (Different package) which is being imported into 
>>>>>>>>>>>>>>>>> my pipeline
>>>>>>>>>>>>>>>>> code. So perhaps there is some configuration which is missing 
>>>>>>>>>>>>>>>>> which allows
>>>>>>>>>>>>>>>>> the framework to discover my Pojo and the annotations 
>>>>>>>>>>>>>>>>> associated with it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 3:47 PM gaurav mishra <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 3:36 PM Reuven Lax <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 3:10 PM gaurav mishra <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I think I can make it work now. I found a utility
>>>>>>>>>>>>>>>>>>>> method for building my coder from class
>>>>>>>>>>>>>>>>>>>> Something like
>>>>>>>>>>>>>>>>>>>> Class<Data> dataClass = userConfig.getDataClass();
>>>>>>>>>>>>>>>>>>>> Coder<Data> dataCoder =
>>>>>>>>>>>>>>>>>>>> SchemaCoder.of(schemaRegistry.getSchema(dataClass),
>>>>>>>>>>>>>>>>>>>>                 TypeDescriptor.of(dataClass),
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> schemaRegistry.getToRowFunction(dataClass),
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> schemaRegistry.getFromRowFunction(dataClass));
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This will work. Though, did annotating the POJO like I
>>>>>>>>>>>>>>>>>>> said not work?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  No, annotation alone does not work since I am not using
>>>>>>>>>>>>>>>>>> concrete classes in the code where the pipeline is being 
>>>>>>>>>>>>>>>>>> constructed.
>>>>>>>>>>>>>>>>>> <Data> above is a template variable in the class which is 
>>>>>>>>>>>>>>>>>> constructing the
>>>>>>>>>>>>>>>>>> pipeline.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 2:14 PM gaurav mishra <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> removing setCoder call breaks my pipeline.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> No Coder has been manually specified;  you may do so
>>>>>>>>>>>>>>>>>>>>> using .setCoder().
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>   Inferring a Coder from the CoderRegistry failed:
>>>>>>>>>>>>>>>>>>>>> Unable to provide a Coder for Data.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>   Building a Coder using a registered CoderProvider
>>>>>>>>>>>>>>>>>>>>> failed.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Reason being the code which is building the pipeline
>>>>>>>>>>>>>>>>>>>>> is based on Java Generics. Actual pipeline building code 
>>>>>>>>>>>>>>>>>>>>> sets a bunch of
>>>>>>>>>>>>>>>>>>>>> parameters which are used to construct the pipeline.
>>>>>>>>>>>>>>>>>>>>> PCollection<Data> stream =
>>>>>>>>>>>>>>>>>>>>> pipeline.apply(userProvidedTransform).get(outputTag).setCoder(userProvidedCoder)
>>>>>>>>>>>>>>>>>>>>> So I guess I will need to provide some more
>>>>>>>>>>>>>>>>>>>>> information to the framework to make the annotation work.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 1:39 PM Reuven Lax <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> If you annotate your POJO
>>>>>>>>>>>>>>>>>>>>>> with @DefaultSchema(JavaFieldSchema.class), that will 
>>>>>>>>>>>>>>>>>>>>>> usually automatically
>>>>>>>>>>>>>>>>>>>>>> set up schema inference (you'll have to remove the 
>>>>>>>>>>>>>>>>>>>>>> setCoder call).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 1:32 PM gaurav mishra <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> How to set up my pipeline to use Beam's schema
>>>>>>>>>>>>>>>>>>>>>>> encoding.
>>>>>>>>>>>>>>>>>>>>>>> In my current code I am doing something like this
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> PCollection<Data> =
>>>>>>>>>>>>>>>>>>>>>>> pipeline.apply(someTransform).get(outputTag).setCoder(AvroCoder.of(Data.class))
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 1:16 PM Reuven Lax <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I don't think we make any guarantees about Avro
>>>>>>>>>>>>>>>>>>>>>>>> coder. Can you use Beam's schema encoding instead?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Jan 9, 2022 at 1:14 PM gaurav mishra <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Is there a way to programmatically check for
>>>>>>>>>>>>>>>>>>>>>>>>> compatibility? I would like to fail my unit tests if 
>>>>>>>>>>>>>>>>>>>>>>>>> incompatible changes
>>>>>>>>>>>>>>>>>>>>>>>>> are made to Pojo.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jan 7, 2022 at 4:49 PM Luke Cwik <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Check the schema of the avro encoding for the
>>>>>>>>>>>>>>>>>>>>>>>>>> POJO before and after the change to ensure that they 
>>>>>>>>>>>>>>>>>>>>>>>>>> are compatible as you
>>>>>>>>>>>>>>>>>>>>>>>>>> expect.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jan 7, 2022 at 4:12 PM gaurav mishra <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> This is more of a Dataflow question I guess but
>>>>>>>>>>>>>>>>>>>>>>>>>>> asking here in hopes someone has faced a similar 
>>>>>>>>>>>>>>>>>>>>>>>>>>> problem and can help.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I am trying to use "--update" option to update a
>>>>>>>>>>>>>>>>>>>>>>>>>>> running Dataflow job. I am noticing that 
>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility checks fail any time
>>>>>>>>>>>>>>>>>>>>>>>>>>> I add a new field to my data model. Error says
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> The Coder or type for step XYZ  has changed
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I am using a Java Pojo for data.  Avro coder to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> serialize the model. I read somewhere that adding 
>>>>>>>>>>>>>>>>>>>>>>>>>>> new optional fields to the data should work when 
>>>>>>>>>>>>>>>>>>>>>>>>>>> updating the pipeline.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I am fine with updating the coder or implementation 
>>>>>>>>>>>>>>>>>>>>>>>>>>> of the model to something which allows me to update 
>>>>>>>>>>>>>>>>>>>>>>>>>>> the pipeline in cases when I add new optional 
>>>>>>>>>>>>>>>>>>>>>>>>>>> fields to existing model. Any suggestions?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>

Re: Updating a Running Dataflow Job

Reply via email to