Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-08-06 Thread Rui Wang
I can help review the Java change.


-Rui

On Thu, Aug 6, 2020 at 9:53 AM Brian Hulette  wrote:

> The PR for this is up now: https://github.com/apache/beam/pull/12481
> Any volunteers to help review? We may want a separate reviewer for Python
> and Java changes.
>
> Brian
>
> On Wed, Aug 5, 2020 at 9:00 AM Brian Hulette  wrote:
>
>> What I'm working on changes ExternalConfigurationPayload [1] to this:
>>
>> message ExternalConfigurationPayload {
>>   // A schema for use in beam:coder:row:v1
>>   Schema schema = 1;
>>
>>   // A payload which can be decoded using beam:coder:row:v1 and the given
>> schema.
>>   bytes payload = 2;
>> }
>>
>> The calling SDK can infer a schema from a user configuration type (in
>> Python we can make minor changes to SchemaBasedPayloadBuilder for this),
>> and use it's implementation of beam:coder:row:v1 to encode an instance of
>> that type to the payload.
>>
>> Similarly the expanding SDK can infer a schema from a user configuration
>> type and map the encoded row to an instance of the user type, assuming the
>> schemas are compatible.
>>
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/86b8326b4ebc4e217585847102743cc1d1af367a/model/pipeline/src/main/proto/external_transforms.proto#L42
>>
>> On Wed, Aug 5, 2020 at 2:04 AM Maximilian Michels  wrote:
>>
>>> +1
>>>
>>> The format to store coders is not set in stone, it was a first version
>>> to make external configuration work. Using the Coder message would be
>>> better.
>>>
>>> As for using Schema to store the configuration, could somebody fill me
>>> in how that would work?
>>>
>>> -Max
>>>
>>> On 04.08.20 02:01, Brian Hulette wrote:
>>> > I've opened BEAM-10571 [1] for this, and I'm most of the way to an
>>> > implementation now. Aiming to have it done before the 2.24.0 cut since
>>> > it will be the last release with python 2 support.
>>> >
>>> > [1] https://issues.apache.org/jira/browse/BEAM-10571
>>> >
>>> > On Wed, Jul 15, 2020 at 9:03 AM Chamikara Jayalath <
>>> chamik...@google.com
>>> > > wrote:
>>> >
>>> >
>>> >
>>> > On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw <
>>> rober...@google.com
>>> > > wrote:
>>> >
>>> > On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette
>>> > mailto:bhule...@google.com>> wrote:
>>> >  >
>>> >  > Ah yes I'm +1 for that approach too - it would let us
>>> > leverage all the schema-inference already in the Java SDK for
>>> > translating configuration objects which would be great.
>>> >  > Things on the Python side would be trickier as schemas don't
>>> > formally support all the types you can use in the
>>> PayloadBuilder
>>> > implementations [1] yet, just NamedTuple. For now we could just
>>> > make the PayloadBuilder implementations generate Rows without
>>> > making that translation available for use in PCollections.
>>> >
>>> >
>>> > This will be a good opportunity to add some sort of a minimal
>>> Python
>>> > type to Beam schema mapping :)
>>> >
>>> >
>>> > Yes, though eventually it might be nice to support all of these
>>> > various types as schema'd PCollection elements as well.
>>> >
>>> >  > Do we need to worry about update compatibility for
>>> > ExternalConfigurationPayload?
>>> >
>>> > Technically, each URN defines their payload, and the fact that
>>> we've
>>> > settled on ExternalConfigurationPayload is a convention. On a
>>> > practical note, we haven't declared these protos stable yet. (I
>>> > would
>>> > like to do so before we drop support for Python 2, as external
>>> > transforms are a possible escape hatch and the first strong
>>> > motivation
>>> > to have external transforms that span Beam versions).
>>> >
>>> >
>>> > +1
>>> >
>>> >
>>> >  > [1]
>>> >
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
>>> >  >
>>> >  > On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw
>>> > mailto:rober...@google.com>> wrote:
>>> >  >>
>>> >  >> I would be in favor of just using a schema to store the
>>> entire
>>> >  >> configuration. The reason we went with what we have to day
>>> > is that we
>>> >  >> didn't have cross language schemas yet.
>>> >  >>
>>> >  >> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette
>>> > mailto:bhule...@google.com>> wrote:
>>> >  >> >
>>> >  >> > Hi everyone,
>>> >  >> > I noticed that currently the ExternalConfigurationPayload
>>> > uses a list of coder URNs to represent the coder that was used
>>> > to serialize each configuration field [1]. This seems
>>> acceptable
>>> > at first blush, but there's one notable issue: it has no place
>>> > to store a payload for the coder. Most standard coders don't

Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-08-06 Thread Brian Hulette
The PR for this is up now: https://github.com/apache/beam/pull/12481
Any volunteers to help review? We may want a separate reviewer for Python
and Java changes.

Brian

On Wed, Aug 5, 2020 at 9:00 AM Brian Hulette  wrote:

> What I'm working on changes ExternalConfigurationPayload [1] to this:
>
> message ExternalConfigurationPayload {
>   // A schema for use in beam:coder:row:v1
>   Schema schema = 1;
>
>   // A payload which can be decoded using beam:coder:row:v1 and the given
> schema.
>   bytes payload = 2;
> }
>
> The calling SDK can infer a schema from a user configuration type (in
> Python we can make minor changes to SchemaBasedPayloadBuilder for this),
> and use it's implementation of beam:coder:row:v1 to encode an instance of
> that type to the payload.
>
> Similarly the expanding SDK can infer a schema from a user configuration
> type and map the encoded row to an instance of the user type, assuming the
> schemas are compatible.
>
> Brian
>
> [1]
> https://github.com/apache/beam/blob/86b8326b4ebc4e217585847102743cc1d1af367a/model/pipeline/src/main/proto/external_transforms.proto#L42
>
> On Wed, Aug 5, 2020 at 2:04 AM Maximilian Michels  wrote:
>
>> +1
>>
>> The format to store coders is not set in stone, it was a first version
>> to make external configuration work. Using the Coder message would be
>> better.
>>
>> As for using Schema to store the configuration, could somebody fill me
>> in how that would work?
>>
>> -Max
>>
>> On 04.08.20 02:01, Brian Hulette wrote:
>> > I've opened BEAM-10571 [1] for this, and I'm most of the way to an
>> > implementation now. Aiming to have it done before the 2.24.0 cut since
>> > it will be the last release with python 2 support.
>> >
>> > [1] https://issues.apache.org/jira/browse/BEAM-10571
>> >
>> > On Wed, Jul 15, 2020 at 9:03 AM Chamikara Jayalath <
>> chamik...@google.com
>> > > wrote:
>> >
>> >
>> >
>> > On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw <
>> rober...@google.com
>> > > wrote:
>> >
>> > On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette
>> > mailto:bhule...@google.com>> wrote:
>> >  >
>> >  > Ah yes I'm +1 for that approach too - it would let us
>> > leverage all the schema-inference already in the Java SDK for
>> > translating configuration objects which would be great.
>> >  > Things on the Python side would be trickier as schemas don't
>> > formally support all the types you can use in the PayloadBuilder
>> > implementations [1] yet, just NamedTuple. For now we could just
>> > make the PayloadBuilder implementations generate Rows without
>> > making that translation available for use in PCollections.
>> >
>> >
>> > This will be a good opportunity to add some sort of a minimal Python
>> > type to Beam schema mapping :)
>> >
>> >
>> > Yes, though eventually it might be nice to support all of these
>> > various types as schema'd PCollection elements as well.
>> >
>> >  > Do we need to worry about update compatibility for
>> > ExternalConfigurationPayload?
>> >
>> > Technically, each URN defines their payload, and the fact that
>> we've
>> > settled on ExternalConfigurationPayload is a convention. On a
>> > practical note, we haven't declared these protos stable yet. (I
>> > would
>> > like to do so before we drop support for Python 2, as external
>> > transforms are a possible escape hatch and the first strong
>> > motivation
>> > to have external transforms that span Beam versions).
>> >
>> >
>> > +1
>> >
>> >
>> >  > [1]
>> >
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
>> >  >
>> >  > On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw
>> > mailto:rober...@google.com>> wrote:
>> >  >>
>> >  >> I would be in favor of just using a schema to store the
>> entire
>> >  >> configuration. The reason we went with what we have to day
>> > is that we
>> >  >> didn't have cross language schemas yet.
>> >  >>
>> >  >> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette
>> > mailto:bhule...@google.com>> wrote:
>> >  >> >
>> >  >> > Hi everyone,
>> >  >> > I noticed that currently the ExternalConfigurationPayload
>> > uses a list of coder URNs to represent the coder that was used
>> > to serialize each configuration field [1]. This seems acceptable
>> > at first blush, but there's one notable issue: it has no place
>> > to store a payload for the coder. Most standard coders don't use
>> > a payload so it's not a problem, but row coder does use a
>> > payload to store it's schema, which means it can't be used in an
>> > ExternalConfigurationPayload today.
>> >  >> >
>> >  

Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-08-05 Thread Brian Hulette
What I'm working on changes ExternalConfigurationPayload [1] to this:

message ExternalConfigurationPayload {
  // A schema for use in beam:coder:row:v1
  Schema schema = 1;

  // A payload which can be decoded using beam:coder:row:v1 and the given
schema.
  bytes payload = 2;
}

The calling SDK can infer a schema from a user configuration type (in
Python we can make minor changes to SchemaBasedPayloadBuilder for this),
and use it's implementation of beam:coder:row:v1 to encode an instance of
that type to the payload.

Similarly the expanding SDK can infer a schema from a user configuration
type and map the encoded row to an instance of the user type, assuming the
schemas are compatible.

Brian

[1]
https://github.com/apache/beam/blob/86b8326b4ebc4e217585847102743cc1d1af367a/model/pipeline/src/main/proto/external_transforms.proto#L42

On Wed, Aug 5, 2020 at 2:04 AM Maximilian Michels  wrote:

> +1
>
> The format to store coders is not set in stone, it was a first version
> to make external configuration work. Using the Coder message would be
> better.
>
> As for using Schema to store the configuration, could somebody fill me
> in how that would work?
>
> -Max
>
> On 04.08.20 02:01, Brian Hulette wrote:
> > I've opened BEAM-10571 [1] for this, and I'm most of the way to an
> > implementation now. Aiming to have it done before the 2.24.0 cut since
> > it will be the last release with python 2 support.
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-10571
> >
> > On Wed, Jul 15, 2020 at 9:03 AM Chamikara Jayalath  > > wrote:
> >
> >
> >
> > On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw  > > wrote:
> >
> > On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette
> > mailto:bhule...@google.com>> wrote:
> >  >
> >  > Ah yes I'm +1 for that approach too - it would let us
> > leverage all the schema-inference already in the Java SDK for
> > translating configuration objects which would be great.
> >  > Things on the Python side would be trickier as schemas don't
> > formally support all the types you can use in the PayloadBuilder
> > implementations [1] yet, just NamedTuple. For now we could just
> > make the PayloadBuilder implementations generate Rows without
> > making that translation available for use in PCollections.
> >
> >
> > This will be a good opportunity to add some sort of a minimal Python
> > type to Beam schema mapping :)
> >
> >
> > Yes, though eventually it might be nice to support all of these
> > various types as schema'd PCollection elements as well.
> >
> >  > Do we need to worry about update compatibility for
> > ExternalConfigurationPayload?
> >
> > Technically, each URN defines their payload, and the fact that
> we've
> > settled on ExternalConfigurationPayload is a convention. On a
> > practical note, we haven't declared these protos stable yet. (I
> > would
> > like to do so before we drop support for Python 2, as external
> > transforms are a possible escape hatch and the first strong
> > motivation
> > to have external transforms that span Beam versions).
> >
> >
> > +1
> >
> >
> >  > [1]
> >
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
> >  >
> >  > On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw
> > mailto:rober...@google.com>> wrote:
> >  >>
> >  >> I would be in favor of just using a schema to store the
> entire
> >  >> configuration. The reason we went with what we have to day
> > is that we
> >  >> didn't have cross language schemas yet.
> >  >>
> >  >> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette
> > mailto:bhule...@google.com>> wrote:
> >  >> >
> >  >> > Hi everyone,
> >  >> > I noticed that currently the ExternalConfigurationPayload
> > uses a list of coder URNs to represent the coder that was used
> > to serialize each configuration field [1]. This seems acceptable
> > at first blush, but there's one notable issue: it has no place
> > to store a payload for the coder. Most standard coders don't use
> > a payload so it's not a problem, but row coder does use a
> > payload to store it's schema, which means it can't be used in an
> > ExternalConfigurationPayload today.
> >  >> >
> >  >> > Is there a reason not to just use the Coder message [2] in
> > ExternalConfigurationPayload instead of a list of coder URNs?
> > That would work with row coder, and it would also make it easier
> > to re-use logic for translating Pipeline protos.
> >  >> >
> >  >> > I'd be happy to make this change, but I wanted to ask on
> > dev@ in case there's something 

Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-08-05 Thread Maximilian Michels

+1

The format to store coders is not set in stone, it was a first version 
to make external configuration work. Using the Coder message would be 
better.


As for using Schema to store the configuration, could somebody fill me 
in how that would work?


-Max

On 04.08.20 02:01, Brian Hulette wrote:
I've opened BEAM-10571 [1] for this, and I'm most of the way to an 
implementation now. Aiming to have it done before the 2.24.0 cut since 
it will be the last release with python 2 support.


[1] https://issues.apache.org/jira/browse/BEAM-10571

On Wed, Jul 15, 2020 at 9:03 AM Chamikara Jayalath > wrote:




On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw mailto:rober...@google.com>> wrote:

On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette
mailto:bhule...@google.com>> wrote:
 >
 > Ah yes I'm +1 for that approach too - it would let us
leverage all the schema-inference already in the Java SDK for
translating configuration objects which would be great.
 > Things on the Python side would be trickier as schemas don't
formally support all the types you can use in the PayloadBuilder
implementations [1] yet, just NamedTuple. For now we could just
make the PayloadBuilder implementations generate Rows without
making that translation available for use in PCollections.


This will be a good opportunity to add some sort of a minimal Python
type to Beam schema mapping :)


Yes, though eventually it might be nice to support all of these
various types as schema'd PCollection elements as well.

 > Do we need to worry about update compatibility for
ExternalConfigurationPayload?

Technically, each URN defines their payload, and the fact that we've
settled on ExternalConfigurationPayload is a convention. On a
practical note, we haven't declared these protos stable yet. (I
would
like to do so before we drop support for Python 2, as external
transforms are a possible escape hatch and the first strong
motivation
to have external transforms that span Beam versions).


+1


 > [1]

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
 >
 > On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw
mailto:rober...@google.com>> wrote:
 >>
 >> I would be in favor of just using a schema to store the entire
 >> configuration. The reason we went with what we have to day
is that we
 >> didn't have cross language schemas yet.
 >>
 >> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette
mailto:bhule...@google.com>> wrote:
 >> >
 >> > Hi everyone,
 >> > I noticed that currently the ExternalConfigurationPayload
uses a list of coder URNs to represent the coder that was used
to serialize each configuration field [1]. This seems acceptable
at first blush, but there's one notable issue: it has no place
to store a payload for the coder. Most standard coders don't use
a payload so it's not a problem, but row coder does use a
payload to store it's schema, which means it can't be used in an
ExternalConfigurationPayload today.
 >> >
 >> > Is there a reason not to just use the Coder message [2] in
ExternalConfigurationPayload instead of a list of coder URNs?
That would work with row coder, and it would also make it easier
to re-use logic for translating Pipeline protos.
 >> >
 >> > I'd be happy to make this change, but I wanted to ask on
dev@ in case there's something I'm missing here.
 >> >
 >> > Brian
 >> >
 >> > [1]

https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
 >> > [2]

https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555



Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-08-03 Thread Brian Hulette
I've opened BEAM-10571 [1] for this, and I'm most of the way to an
implementation now. Aiming to have it done before the 2.24.0 cut since it
will be the last release with python 2 support.

[1] https://issues.apache.org/jira/browse/BEAM-10571

On Wed, Jul 15, 2020 at 9:03 AM Chamikara Jayalath 
wrote:

>
>
> On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw 
> wrote:
>
>> On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette 
>> wrote:
>> >
>> > Ah yes I'm +1 for that approach too - it would let us leverage all the
>> schema-inference already in the Java SDK for translating configuration
>> objects which would be great.
>> > Things on the Python side would be trickier as schemas don't formally
>> support all the types you can use in the PayloadBuilder implementations [1]
>> yet, just NamedTuple. For now we could just make the PayloadBuilder
>> implementations generate Rows without making that translation available for
>> use in PCollections.
>>
>
> This will be a good opportunity to add some sort of a minimal Python type
> to Beam schema mapping :)
>
>
>>
>> Yes, though eventually it might be nice to support all of these
>> various types as schema'd PCollection elements as well.
>>
>> > Do we need to worry about update compatibility for
>> ExternalConfigurationPayload?
>>
>> Technically, each URN defines their payload, and the fact that we've
>> settled on ExternalConfigurationPayload is a convention. On a
>> practical note, we haven't declared these protos stable yet. (I would
>> like to do so before we drop support for Python 2, as external
>> transforms are a possible escape hatch and the first strong motivation
>> to have external transforms that span Beam versions).
>>
>
> +1
>
>
>>
>> > [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
>> >
>> > On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw 
>> wrote:
>> >>
>> >> I would be in favor of just using a schema to store the entire
>> >> configuration. The reason we went with what we have to day is that we
>> >> didn't have cross language schemas yet.
>> >>
>> >> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette 
>> wrote:
>> >> >
>> >> > Hi everyone,
>> >> > I noticed that currently the ExternalConfigurationPayload uses a
>> list of coder URNs to represent the coder that was used to serialize each
>> configuration field [1]. This seems acceptable at first blush, but there's
>> one notable issue: it has no place to store a payload for the coder. Most
>> standard coders don't use a payload so it's not a problem, but row coder
>> does use a payload to store it's schema, which means it can't be used in an
>> ExternalConfigurationPayload today.
>> >> >
>> >> > Is there a reason not to just use the Coder message [2] in
>> ExternalConfigurationPayload instead of a list of coder URNs? That would
>> work with row coder, and it would also make it easier to re-use logic for
>> translating Pipeline protos.
>> >> >
>> >> > I'd be happy to make this change, but I wanted to ask on dev@ in
>> case there's something I'm missing here.
>> >> >
>> >> > Brian
>> >> >
>> >> > [1]
>> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
>> >> > [2]
>> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555
>>
>


Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-07-15 Thread Chamikara Jayalath
On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw  wrote:

> On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette  wrote:
> >
> > Ah yes I'm +1 for that approach too - it would let us leverage all the
> schema-inference already in the Java SDK for translating configuration
> objects which would be great.
> > Things on the Python side would be trickier as schemas don't formally
> support all the types you can use in the PayloadBuilder implementations [1]
> yet, just NamedTuple. For now we could just make the PayloadBuilder
> implementations generate Rows without making that translation available for
> use in PCollections.
>

This will be a good opportunity to add some sort of a minimal Python type
to Beam schema mapping :)


>
> Yes, though eventually it might be nice to support all of these
> various types as schema'd PCollection elements as well.
>
> > Do we need to worry about update compatibility for
> ExternalConfigurationPayload?
>
> Technically, each URN defines their payload, and the fact that we've
> settled on ExternalConfigurationPayload is a convention. On a
> practical note, we haven't declared these protos stable yet. (I would
> like to do so before we drop support for Python 2, as external
> transforms are a possible escape hatch and the first strong motivation
> to have external transforms that span Beam versions).
>

+1


>
> > [1]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
> >
> > On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw 
> wrote:
> >>
> >> I would be in favor of just using a schema to store the entire
> >> configuration. The reason we went with what we have to day is that we
> >> didn't have cross language schemas yet.
> >>
> >> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette 
> wrote:
> >> >
> >> > Hi everyone,
> >> > I noticed that currently the ExternalConfigurationPayload uses a list
> of coder URNs to represent the coder that was used to serialize each
> configuration field [1]. This seems acceptable at first blush, but there's
> one notable issue: it has no place to store a payload for the coder. Most
> standard coders don't use a payload so it's not a problem, but row coder
> does use a payload to store it's schema, which means it can't be used in an
> ExternalConfigurationPayload today.
> >> >
> >> > Is there a reason not to just use the Coder message [2] in
> ExternalConfigurationPayload instead of a list of coder URNs? That would
> work with row coder, and it would also make it easier to re-use logic for
> translating Pipeline protos.
> >> >
> >> > I'd be happy to make this change, but I wanted to ask on dev@ in
> case there's something I'm missing here.
> >> >
> >> > Brian
> >> >
> >> > [1]
> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
> >> > [2]
> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555
>


Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-07-10 Thread Robert Bradshaw
On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette  wrote:
>
> Ah yes I'm +1 for that approach too - it would let us leverage all the 
> schema-inference already in the Java SDK for translating configuration 
> objects which would be great.
> Things on the Python side would be trickier as schemas don't formally support 
> all the types you can use in the PayloadBuilder implementations [1] yet, just 
> NamedTuple. For now we could just make the PayloadBuilder implementations 
> generate Rows without making that translation available for use in 
> PCollections.

Yes, though eventually it might be nice to support all of these
various types as schema'd PCollection elements as well.

> Do we need to worry about update compatibility for 
> ExternalConfigurationPayload?

Technically, each URN defines their payload, and the fact that we've
settled on ExternalConfigurationPayload is a convention. On a
practical note, we haven't declared these protos stable yet. (I would
like to do so before we drop support for Python 2, as external
transforms are a possible escape hatch and the first strong motivation
to have external transforms that span Beam versions).

> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
>
> On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw  wrote:
>>
>> I would be in favor of just using a schema to store the entire
>> configuration. The reason we went with what we have to day is that we
>> didn't have cross language schemas yet.
>>
>> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette  wrote:
>> >
>> > Hi everyone,
>> > I noticed that currently the ExternalConfigurationPayload uses a list of 
>> > coder URNs to represent the coder that was used to serialize each 
>> > configuration field [1]. This seems acceptable at first blush, but there's 
>> > one notable issue: it has no place to store a payload for the coder. Most 
>> > standard coders don't use a payload so it's not a problem, but row coder 
>> > does use a payload to store it's schema, which means it can't be used in 
>> > an ExternalConfigurationPayload today.
>> >
>> > Is there a reason not to just use the Coder message [2] in 
>> > ExternalConfigurationPayload instead of a list of coder URNs? That would 
>> > work with row coder, and it would also make it easier to re-use logic for 
>> > translating Pipeline protos.
>> >
>> > I'd be happy to make this change, but I wanted to ask on dev@ in case 
>> > there's something I'm missing here.
>> >
>> > Brian
>> >
>> > [1] 
>> > https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
>> > [2] 
>> > https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555


Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-07-10 Thread Brian Hulette
Ah yes I'm +1 for that approach too - it would let us leverage all the
schema-inference already in the Java SDK for translating configuration
objects which would be great.
Things on the Python side would be trickier as schemas don't formally
support all the types you can use in the PayloadBuilder implementations [1]
yet, just NamedTuple. For now we could just make the PayloadBuilder
implementations generate Rows without making that translation available for
use in PCollections.

Do we need to worry about update compatibility for
ExternalConfigurationPayload?

Brian

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py

On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw  wrote:

> I would be in favor of just using a schema to store the entire
> configuration. The reason we went with what we have to day is that we
> didn't have cross language schemas yet.
>
> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette 
> wrote:
> >
> > Hi everyone,
> > I noticed that currently the ExternalConfigurationPayload uses a list of
> coder URNs to represent the coder that was used to serialize each
> configuration field [1]. This seems acceptable at first blush, but there's
> one notable issue: it has no place to store a payload for the coder. Most
> standard coders don't use a payload so it's not a problem, but row coder
> does use a payload to store it's schema, which means it can't be used in an
> ExternalConfigurationPayload today.
> >
> > Is there a reason not to just use the Coder message [2] in
> ExternalConfigurationPayload instead of a list of coder URNs? That would
> work with row coder, and it would also make it easier to re-use logic for
> translating Pipeline protos.
> >
> > I'd be happy to make this change, but I wanted to ask on dev@ in case
> there's something I'm missing here.
> >
> > Brian
> >
> > [1]
> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
> > [2]
> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555
>


Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-07-10 Thread Robert Bradshaw
I would be in favor of just using a schema to store the entire
configuration. The reason we went with what we have to day is that we
didn't have cross language schemas yet.

On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette  wrote:
>
> Hi everyone,
> I noticed that currently the ExternalConfigurationPayload uses a list of 
> coder URNs to represent the coder that was used to serialize each 
> configuration field [1]. This seems acceptable at first blush, but there's 
> one notable issue: it has no place to store a payload for the coder. Most 
> standard coders don't use a payload so it's not a problem, but row coder does 
> use a payload to store it's schema, which means it can't be used in an 
> ExternalConfigurationPayload today.
>
> Is there a reason not to just use the Coder message [2] in 
> ExternalConfigurationPayload instead of a list of coder URNs? That would work 
> with row coder, and it would also make it easier to re-use logic for 
> translating Pipeline protos.
>
> I'd be happy to make this change, but I wanted to ask on dev@ in case there's 
> something I'm missing here.
>
> Brian
>
> [1] 
> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
> [2] 
> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555


Use Coder message for cross-lang ExternalConfigurationPayload?

2020-07-10 Thread Brian Hulette
Hi everyone,
I noticed that currently the ExternalConfigurationPayload uses a list of
coder URNs to represent the coder that was used to serialize each
configuration field [1]. This seems acceptable at first blush, but there's
one notable issue: it has no place to store a payload for the coder. Most
standard coders don't use a payload so it's not a problem, but row coder
does use a payload to store it's schema, which means it can't be used in an
ExternalConfigurationPayload today.

Is there a reason not to just use the Coder message [2] in
ExternalConfigurationPayload instead of a list of coder URNs? That would
work with row coder, and it would also make it easier to re-use logic for
translating Pipeline protos.

I'd be happy to make this change, but I wanted to ask on dev@ in case
there's something I'm missing here.

Brian

[1]
https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
[2]
https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555