On Thu, Aug 10, 2017 at 5:06 PM, Reuven Lax wrote:
> Interestingly I've seen examples of PTransforms where the transform itself
> is unable to easily set its own coder. This happens when the transform is
> parametrized in such a way that its ouput coder is not determinable except
> by the caller o
I think this is the essence of the guidance: in such cases, the caller
should indeed pass a coder to the PTransform.
This might seem trivial if the only thing the PTransform will do is set it
on the output collection, but it allows the transform to evolve in case it
ever needs to create an interme
Interestingly I've seen examples of PTransforms where the transform itself
is unable to easily set its own coder. This happens when the transform is
parametrized in such a way that its ouput coder is not determinable except
by the caller of the PTransform. The caller can of course pass a coder into
I've updated the guidance in PTransform Style Guide on setting coders
https://beam.apache.org/contribute/ptransform-style-guide/#coders
according to this discussion.
https://github.com/apache/beam-site/pull/279
On Thu, Aug 3, 2017 at 6:27 PM Robert Bradshaw
wrote:
> On Thu, Aug 3, 2017 at 6:08 P
On Thu, Aug 3, 2017 at 6:08 PM, Eugene Kirpichov
wrote:
> https://github.com/apache/beam/pull/3649 has landed. The main contribution
> of this PR is deprecating PTransform.getDefaultOutputCoder().
>
> Next steps are to get rid of all setCoder() calls in the SDK, and deprecate
> setCoder().
> Nearl
https://github.com/apache/beam/pull/3649 has landed. The main contribution
of this PR is deprecating PTransform.getDefaultOutputCoder().
Next steps are to get rid of all setCoder() calls in the SDK, and deprecate
setCoder().
Nearly all setCoder() calls (perhaps simply all?) I found are on the outp
I'm for (1) and am not sure about the feasibility of (2) without having an
escape hatch that allows a pipeline author to specify a coder to handle
their special case.
On Tue, Aug 1, 2017 at 2:15 PM, Reuven Lax wrote:
> One interesting wrinkle: I'm about to propose a set of semantics for
> snapsh
One interesting wrinkle: I'm about to propose a set of semantics for
snapshotting/in-place updating pipelines. Part of this proposal is the
ability to write pipelines to "upgrade" snapshots to make them compatible
with new graphs. This relies on the ability to have two separate coders for
the same
There are two concerns in this thread:
(1) Getting rid of PCollection.setCoder(). Everyone seems in favor of this
(right?)
(2) Deprecating specifying Coders in favor of specifying TypeDescriptors.
I'm generally in favor, but it's unclear how far we can push this through.
Let's at least do (1), a
Another thought on this: setting a custom coder to support a special data
distribution is likely often a property of the input to the pipeline. So
setting a coder during pipeline construction - more generally, when writing
a composite transform for reuse - you might not actually have the needed
inf
On Thu, Jul 27, 2017 at 11:18 AM, Thomas Groh
wrote:
> introduce a
> new, specialized type to represent the restricted
> (alternatively-distributed?) data. The TypeDescriptor for this type can map
> to the specialized coder, without having to perform a significant degree of
> potentially wasted e
And what about having a kind of map in the CoderRegistry with type -> coder ?
byte[] could be an entry, a specialized type another.
Thought ?
Regards
JB
On 07/27/2017 08:18 PM, Thomas Groh wrote:
I think there's a simpler solution than encoding to byte[]: introduce a
new, specialized type to
Fully agree Thomas, I think it's way better.
Regards
JB
On 07/27/2017 08:00 PM, Thomas Groh wrote:
+1 on getting rid of setCoder; just from a Java SDK perspective, my ideal
world contains PCollections which don't have a user-visible way to mutate
them.
My preference would be to use TypeDescrip
I think there's a simpler solution than encoding to byte[]: introduce a
new, specialized type to represent the restricted
(alternatively-distributed?) data. The TypeDescriptor for this type can map
to the specialized coder, without having to perform a significant degree of
potentially wasted encodi
I tend to agree with Robert - it would be unfortunate if a single
TypeDescrictor was forced to have the same encoding all through the
pipeline. However it's also unfortunate if this flexibility impacted every
part of the programming model. I also think that our experience has been
that "large scale
+1 on getting rid of setCoder; just from a Java SDK perspective, my ideal
world contains PCollections which don't have a user-visible way to mutate
them.
My preference would be to use TypeDescriptors everywhere within Pipeline
construction (where possible), and utilize the CoderRegistry everywhere
On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
wrote:
> On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik
> wrote:
>>
>> Ken/Robert, I believe users will want the ability to set the output coder
>> because coders may have intrinsic properties where the type isn't enough
>> information to fully spec
On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik
wrote:
>
> Ken/Robert, I believe users will want the ability to set the output coder
> because coders may have intrinsic properties where the type isn't enough
> information to fully specify what I want as a user. Some cases I can see
> are:
> 1) I have
Eugene, a two pass model allows for the framework to first gather all the
information the user specified, and then in the second pass you actually
perform the expansion of PTransforms. Since the user isn't adding any
additional data to the pipeline, every transform could inspect all the
attributes
Hi,
That's an interesting thread and I was wondering the relationship between type
descriptor and coder for a while ;)
Today, in a PCollection, we can set the coder and we also have a
getTypeDescriptor(). It sounds weird to me: it should be one or the other.
Basically, if the Coder is not u
I like the ideal of "no coders during construction." I also like the idea
of a second pass to determine coders (this would work particularly well in
Python). However, this assumes a 1:1 relationship between TypeDescriptors
and Coders, which is not accurate in practice. In particular:
* Some types
On Wed, Jul 26, 2017 at 8:58 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:
> Hmm, yes, I just noticed that PCollection has a setTypeDescriptor() method,
> and I wonder how much will break if all call sites of setCoder() will call
> setTypeDescriptor() instead - i.e. how far are we fr
Hmm, yes, I just noticed that PCollection has a setTypeDescriptor() method,
and I wonder how much will break if all call sites of setCoder() will call
setTypeDescriptor() instead - i.e. how far are we from the ideal state of
having a coder inferrable for every sufficiently concrete type descriptor.
+1 but maybe go ever further
On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:
> Hello,
>
> I've worked on a few different things recently and ran repeatedly into the
> same issue: that we do not have clear guidance on who should set the Coder
> on a PCollec
Okay, first PR is in review https://github.com/apache/beam/pull/3649
On Wed, Jul 26, 2017 at 11:58 AM Robert Bradshaw
wrote:
> +1, I'm a huge fan of moving this direction. Right now there's also
> the ugliness that setCoder() may be called any number of times before
> a PCollection is used (the
+1, I'm a huge fan of moving this direction. Right now there's also
the ugliness that setCoder() may be called any number of times before
a PCollection is used (the last setter winning) but is an error to
call it once it has been used (and here "used" is not clear--if a
PCollection is returned from
Second that 'it's responsibility of the transform'. For the case when a
PTransform doesn't have enough information(PTransform developer should have
the knowledge), I would prefer a strict way so users won't forget to call
withSomethingCoder(), like
- a Coder is required to new the PTransform;
- or
Hm, can you elaborate? I'm not sure how this relates to my suggestion, the
gist of which is "PTransform's should set the coder on all of their
outputs, and the user should never have to .setCoder() on a PCollection
obtained from a PTransform"
On Wed, Jul 26, 2017 at 7:38 AM Lukasz Cwik
wrote:
>
I'm split between our current one pass model of pipeline construction and a
two pass model where all information is gathered and then PTransform
expansions are performed.
On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:
> Hello,
>
> I've worked on a few di
29 matches
Mail list logo