Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-15 Thread Ismaël Mejía
Vincent, I will be out in the sense that I cannot really engage myself into more activities because I have apart of your review two more pending + other work to finish so I prefer not to add more work I cannot finish. I am still available for the review however so let’s get this finally finished

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-14 Thread Ismaël Mejía
It has been really interesting to read all the perspectives in this thread and I have even switched sides back and forth given the advantages / issues exposed here, so it means we have clear pros/cons. One ‘not so nice‘ discovery related to this discussion for me was BEAM-10375 [1] tldr; Reads

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-08 Thread Robert Bradshaw
OK, I'm +0 on this change. Using the PTransform as an element is probably better than duplicating the full API on another interface, and think it's worth getting this ublocked. This will require a Read2 if we have to add options in a upgrade-compatible way. On Tue, Jul 7, 2020 at 3:19 PM Luke

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-07 Thread Robert Bradshaw
On Tue, Jul 7, 2020 at 2:06 PM Luke Cwik wrote: > > Robert, the intent is that the Read object would use a schema coder and for > XLang purposes would be no different then a POJO. Just to clarify, you're saying that the Read PTransform would be encoded via the schema coder? That still feels a

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-07 Thread Luke Cwik
Robert, the intent is that the Read object would use a schema coder and for XLang purposes would be no different then a POJO. The issue of how to deal with closures applies to both equally and that is why I suggested to favor using data over closures. Once there is an implementation for how to

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-02 Thread Robert Bradshaw
On Thu, Jul 2, 2020 at 10:26 AM Kenneth Knowles wrote: > > On Wed, Jul 1, 2020 at 4:17 PM Eugene Kirpichov > wrote: > >> Kenn - I don't mean an enum of common closures, I mean expressing >> closures in a restricted sub-language such as the language of SQL >> expressions. >> > > My lack of

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-02 Thread Kenneth Knowles
On Wed, Jul 1, 2020 at 4:17 PM Eugene Kirpichov wrote: > Kenn - I don't mean an enum of common closures, I mean expressing closures > in a restricted sub-language such as the language of SQL expressions. > My lack of clarity: enums was my phrasing of Luke's item 1). I understood what you meant.

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-02 Thread Robert Bradshaw
I, too, have a general preference for using a POJO (and cross-language) compatible config object for ReadAll rather than having the Read elements themselves as PCollection elements. It seem the motivation is that the Read object already has all the builder methods (plus some other configuration

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-01 Thread Eugene Kirpichov
Kenn - I don't mean an enum of common closures, I mean expressing closures in a restricted sub-language such as the language of SQL expressions. That would only work if there is a portable way to interpret SQL expressions, but if there isn't, maybe there should be - for the sake of, well,

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-01 Thread Robert Burke
>From the Go side of the table, the Go language doesn't provide a mechanism to serialize or access closure data, which means DoFns can't be functional closures.This combined with the move to have the "Structural DoFns" be serialized using Beam Schemas, has the net result that if Go transforms are

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-30 Thread Kenneth Knowles
I don't think an enum of most common closures will work. The input types are typically generics that are made concrete by the caller who also provides the closures. I think Luke's (2) is the same idea as my "Java still assembles it [using opaque Python closures/transforms]". It seems like an

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-30 Thread Eugene Kirpichov
Yeah, mainly I just feel like dynamic reads and dynamic writes (and perhaps not-yet-invented similar transforms of other kinds) are tightly related - they are either very similar, or are duals of each other - so they should use the same approach. If they are using different approaches, it is a

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-30 Thread Luke Cwik
Kenn, I'm not too worried about closures since: 1) the expansion service for a transform could have a well set of defined closures by name that are returned as serialized objects that don't need to be interpretable by the caller 2) the language could store serialized functions of another language

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-29 Thread Brian Hulette
Sorry for jumping into this late and casting a vote against the consensus... but I think I'd prefer standardizing on a pattern like PCollection rather than PCollection. That approach clearly separates the parameters that are allowed to vary across a ReadAll (the ones defined in

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-29 Thread Kenneth Knowles
Eugene, let me see if I understand correctly. Your initial email suggested we consider adopting the "dynamic write" style for ReadAll transforms. Supposing Read had three fields x, y, and z, and you have a PCollection with all the info you need. (Ismaël's) "Read" way: Use a PTransform to

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-29 Thread Eugene Kirpichov
I'd like to raise one more time the question of consistency between dynamic reads and dynamic writes, per my email at the beginning of the thread. If the community prefers ReadAll to read from Read, then should dynamicWrite's write to Write? On Mon, Jun 29, 2020 at 8:57 AM Boyuan Zhang wrote: >

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-29 Thread Boyuan Zhang
It seems like most of us agree on the idea that ReadAll should read from Read. I'm going to update the Kafka ReadAll with the same pattern. Thanks for all your help! On Fri, Jun 26, 2020 at 12:12 PM Chamikara Jayalath wrote: > > > On Fri, Jun 26, 2020 at 11:49 AM Luke Cwik wrote: > >> I would

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-26 Thread Chamikara Jayalath
On Fri, Jun 26, 2020 at 11:49 AM Luke Cwik wrote: > I would also like to suggest that transforms that implement ReadAll via > Read should also provide methods like: > > // Uses the specified values if unspecified in the input element from the > PCollection. > withDefaults(Read read); > // Uses

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-26 Thread Luke Cwik
I would also like to suggest that transforms that implement ReadAll via Read should also provide methods like: // Uses the specified values if unspecified in the input element from the PCollection. withDefaults(Read read); // Uses the specified values regardless of what the input element from the

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-26 Thread Luke Cwik
Ismael, it is good to hear that using Read as the input didn't have a bunch of parameters that were being skipped/ignored. Also, for the polymorphism issue you have to rely on the user correctly telling you the type in such a way where it is a common ancestor of all the runtime types that will

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-25 Thread Chamikara Jayalath
Discussion regarding cross-language transforms is a slight tangent here. But I think, in general, it's great if we can use existing transforms (for example, IO connectors) as cross-language transforms without having to build more composites (irrespective of whether in ExternalTransformBuilders or

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-25 Thread Boyuan Zhang
For unbounded SDF in Kafka, we also consider the upgrading/downgrading compatibility in the pipeline update scenario(For detailed discussion, please refer to https://lists.apache.org/thread.html/raf073b8741317244339eb5b2bce844c0f9e0d700c3e4de392fc648d6%40%3Cdev.beam.apache.org%3E). In order to

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-25 Thread Ismaël Mejía
We forgot to mention (5) External.Config used in cross-lang, see KafkaIO ExternalTransformBuilder. This approach is the predecessor of (4) and probably a really good candidate to be replaced by the Row based Configuration Boyuan is envisioning (so good to be aware of this). Thanks for the clear

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-25 Thread Luke Cwik
I had mentioned that approach 1 and approach 2 work for cross language. The difference being that the cross language transform would take a well known definition and convert it to the Read transform. A normal user would have a pipeline that would look like: 1: PCollection -> PTransform(ReadAll) ->

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-25 Thread Alexey Romanenko
I believe that the initial goal of unifying ReadAll as a general "PTransform, PCollection>” was to reduce the amount of code duplication and error-prone approach related to this. It makes much sense since usually we have all needed configuration set in Read objects and, as Ismaeil mentioned,

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Boyuan Zhang
Sorry for the typo. I mean I think we can go with *(3)* and (4): use the data type that is schema-aware as the input of ReadAll. On Wed, Jun 24, 2020 at 7:42 PM Boyuan Zhang wrote: > Thanks for the summary, Cham! > > I think we can go with (2) and (4): use the data type that is schema-aware >

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Boyuan Zhang
Thanks for the summary, Cham! I think we can go with (2) and (4): use the data type that is schema-aware as the input of ReadAll. Converting Read into ReadAll helps us to stick with SDF-like IO. But only having (3) is not enough to solve the problem of using ReadAll in x-lang case. The key

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Chamikara Jayalath
I see. So it seems like there are three options discussed so far when it comes to defining source descriptors for ReadAll type transforms (1) Use Read PTransform as the element type of the input PCollection (2) Use a POJO that describes the source as the data element of the input PCollection (3)

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Luke Cwik
I believe we do require PTransforms to be serializable since anonymous DoFns typically capture the enclosing PTransform. On Wed, Jun 24, 2020 at 10:52 AM Chamikara Jayalath wrote: > Seems like Read in PCollection refers to a transform, at least here: >

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Chamikara Jayalath
Seems like Read in PCollection refers to a transform, at least here: https://github.com/apache/beam/blob/master/sdks/java/io/hbase/src/main/java/org/apache/beam/sdk/io/hbase/HBaseIO.java#L353 I'm in favour of separating construction time transforms from execution time data objects that we store

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Boyuan Zhang
Hi Ismael, I think the ReadAll in the IO connector refers to the IO with SDF implementation despite the type of input, where Read refers to UnboundedSource. One major pushback of using KafkaIO.Read as source description is that not all configurations of KafkaIO.Read are meaningful to populate

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Luke Cwik
To provide additional context, the KafkaIO ReadAll transform takes a PCollection. This KafkaSourceDescriptor is a POJO that contains the configurable parameters for reading from Kafka. This is different from the pattern that Ismael listed because they take PCollection as input and the Read is the

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Eugene Kirpichov
Hi Ismael, Thanks for taking this on. Have you considered an approach similar (or dual) to FileIO.write(), where we in a sense also have to configure a dynamic number different IO transforms of the same type (file writes)? E.g. how in this example we configure many aspects of many file writes: