Re: Requiring PTransform to set a coder on its resulting collections

2017-08-11 Thread Robert Bradshaw
On Thu, Aug 10, 2017 at 5:06 PM, Reuven Lax wrote: > Interestingly I've seen examples of PTransforms where the transform itself > is unable to easily set its own coder. This happens when the transform is > parametrized in such a way that its ouput coder is not determinable except > by the caller o

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-10 Thread Eugene Kirpichov
I think this is the essence of the guidance: in such cases, the caller should indeed pass a coder to the PTransform. This might seem trivial if the only thing the PTransform will do is set it on the output collection, but it allows the transform to evolve in case it ever needs to create an interme

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-10 Thread Reuven Lax
Interestingly I've seen examples of PTransforms where the transform itself is unable to easily set its own coder. This happens when the transform is parametrized in such a way that its ouput coder is not determinable except by the caller of the PTransform. The caller can of course pass a coder into

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-10 Thread Eugene Kirpichov
I've updated the guidance in PTransform Style Guide on setting coders https://beam.apache.org/contribute/ptransform-style-guide/#coders according to this discussion. https://github.com/apache/beam-site/pull/279 On Thu, Aug 3, 2017 at 6:27 PM Robert Bradshaw wrote: > On Thu, Aug 3, 2017 at 6:08 P

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-03 Thread Robert Bradshaw
On Thu, Aug 3, 2017 at 6:08 PM, Eugene Kirpichov wrote: > https://github.com/apache/beam/pull/3649 has landed. The main contribution > of this PR is deprecating PTransform.getDefaultOutputCoder(). > > Next steps are to get rid of all setCoder() calls in the SDK, and deprecate > setCoder(). > Nearl

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-03 Thread Eugene Kirpichov
https://github.com/apache/beam/pull/3649 has landed. The main contribution of this PR is deprecating PTransform.getDefaultOutputCoder(). Next steps are to get rid of all setCoder() calls in the SDK, and deprecate setCoder(). Nearly all setCoder() calls (perhaps simply all?) I found are on the outp

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-03 Thread Lukasz Cwik
I'm for (1) and am not sure about the feasibility of (2) without having an escape hatch that allows a pipeline author to specify a coder to handle their special case. On Tue, Aug 1, 2017 at 2:15 PM, Reuven Lax wrote: > One interesting wrinkle: I'm about to propose a set of semantics for > snapsh

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-01 Thread Reuven Lax
One interesting wrinkle: I'm about to propose a set of semantics for snapshotting/in-place updating pipelines. Part of this proposal is the ability to write pipelines to "upgrade" snapshots to make them compatible with new graphs. This relies on the ability to have two separate coders for the same

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-01 Thread Robert Bradshaw
There are two concerns in this thread: (1) Getting rid of PCollection.setCoder(). Everyone seems in favor of this (right?) (2) Deprecating specifying Coders in favor of specifying TypeDescriptors. I'm generally in favor, but it's unclear how far we can push this through. Let's at least do (1), a

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Kenneth Knowles
Another thought on this: setting a custom coder to support a special data distribution is likely often a property of the input to the pipeline. So setting a coder during pipeline construction - more generally, when writing a composite transform for reuse - you might not actually have the needed inf

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Kenneth Knowles
On Thu, Jul 27, 2017 at 11:18 AM, Thomas Groh wrote: > introduce a > new, specialized type to represent the restricted > (alternatively-distributed?) data. The TypeDescriptor for this type can map > to the specialized coder, without having to perform a significant degree of > potentially wasted e

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Jean-Baptiste Onofré
And what about having a kind of map in the CoderRegistry with type -> coder ? byte[] could be an entry, a specialized type another. Thought ? Regards JB On 07/27/2017 08:18 PM, Thomas Groh wrote: I think there's a simpler solution than encoding to byte[]: introduce a new, specialized type to

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Jean-Baptiste Onofré
Fully agree Thomas, I think it's way better. Regards JB On 07/27/2017 08:00 PM, Thomas Groh wrote: +1 on getting rid of setCoder; just from a Java SDK perspective, my ideal world contains PCollections which don't have a user-visible way to mutate them. My preference would be to use TypeDescrip

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Thomas Groh
I think there's a simpler solution than encoding to byte[]: introduce a new, specialized type to represent the restricted (alternatively-distributed?) data. The TypeDescriptor for this type can map to the specialized coder, without having to perform a significant degree of potentially wasted encodi

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Reuven Lax
I tend to agree with Robert - it would be unfortunate if a single TypeDescrictor was forced to have the same encoding all through the pipeline. However it's also unfortunate if this flexibility impacted every part of the programming model. I also think that our experience has been that "large scale

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Thomas Groh
+1 on getting rid of setCoder; just from a Java SDK perspective, my ideal world contains PCollections which don't have a user-visible way to mutate them. My preference would be to use TypeDescriptors everywhere within Pipeline construction (where possible), and utilize the CoderRegistry everywhere

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Robert Bradshaw
On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles wrote: > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik > wrote: >> >> Ken/Robert, I believe users will want the ability to set the output coder >> because coders may have intrinsic properties where the type isn't enough >> information to fully spec

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Kenneth Knowles
On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik wrote: > > Ken/Robert, I believe users will want the ability to set the output coder > because coders may have intrinsic properties where the type isn't enough > information to fully specify what I want as a user. Some cases I can see > are: > 1) I have

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Lukasz Cwik
Eugene, a two pass model allows for the framework to first gather all the information the user specified, and then in the second pass you actually perform the expansion of PTransforms. Since the user isn't adding any additional data to the pipeline, every transform could inspect all the attributes

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Jean-Baptiste Onofré
Hi, That's an interesting thread and I was wondering the relationship between type descriptor and coder for a while ;) Today, in a PCollection, we can set the coder and we also have a getTypeDescriptor(). It sounds weird to me: it should be one or the other. Basically, if the Coder is not u

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-27 Thread Robert Bradshaw
I like the ideal of "no coders during construction." I also like the idea of a second pass to determine coders (this would work particularly well in Python). However, this assumes a 1:1 relationship between TypeDescriptors and Coders, which is not accurate in practice. In particular: * Some types

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Kenneth Knowles
On Wed, Jul 26, 2017 at 8:58 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > Hmm, yes, I just noticed that PCollection has a setTypeDescriptor() method, > and I wonder how much will break if all call sites of setCoder() will call > setTypeDescriptor() instead - i.e. how far are we fr

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Eugene Kirpichov
Hmm, yes, I just noticed that PCollection has a setTypeDescriptor() method, and I wonder how much will break if all call sites of setCoder() will call setTypeDescriptor() instead - i.e. how far are we from the ideal state of having a coder inferrable for every sufficiently concrete type descriptor.

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Kenneth Knowles
+1 but maybe go ever further On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > Hello, > > I've worked on a few different things recently and ran repeatedly into the > same issue: that we do not have clear guidance on who should set the Coder > on a PCollec

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Eugene Kirpichov
Okay, first PR is in review https://github.com/apache/beam/pull/3649 On Wed, Jul 26, 2017 at 11:58 AM Robert Bradshaw wrote: > +1, I'm a huge fan of moving this direction. Right now there's also > the ugliness that setCoder() may be called any number of times before > a PCollection is used (the

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Robert Bradshaw
+1, I'm a huge fan of moving this direction. Right now there's also the ugliness that setCoder() may be called any number of times before a PCollection is used (the last setter winning) but is an error to call it once it has been used (and here "used" is not clear--if a PCollection is returned from

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Mingmin Xu
Second that 'it's responsibility of the transform'. For the case when a PTransform doesn't have enough information(PTransform developer should have the knowledge), I would prefer a strict way so users won't forget to call withSomethingCoder(), like - a Coder is required to new the PTransform; - or

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Eugene Kirpichov
Hm, can you elaborate? I'm not sure how this relates to my suggestion, the gist of which is "PTransform's should set the coder on all of their outputs, and the user should never have to .setCoder() on a PCollection obtained from a PTransform" On Wed, Jul 26, 2017 at 7:38 AM Lukasz Cwik wrote: >

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Lukasz Cwik
I'm split between our current one pass model of pipeline construction and a two pass model where all information is gathered and then PTransform expansions are performed. On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > Hello, > > I've worked on a few di