[beam-site] 01/03: Expands the sections on Coders and Validation in Style Guide

jkff Thu, 10 Aug 2017 16:56:18 -0700

This is an automated email from the ASF dual-hosted git repository.

jkff pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam-site.git


commit 0f5e5e746b98cc6596cb7ab74eb752878e8f153c
Author: Eugene Kirpichov <kirpic...@google.com>
AuthorDate: Thu Aug 3 23:13:07 2017 -0700

    Expands the sections on Coders and Validation in Style Guide
    
    * Provides the new guidance that a transform should set coders
      on all collections
    * Provides guidance on choosing and inferring coders
    * Recommends less verbose validation error messages
    * Clarifies what validation goes in expand() vs validate()
---
 src/contribute/ptransform-style-guide.md | 73 ++++++++++++++++++++++++--------
 1 file changed, 55 insertions(+), 18 deletions(-)

diff --git a/src/contribute/ptransform-style-guide.md 
b/src/contribute/ptransform-style-guide.md
index 3624b02..db69c5a 100644
--- a/src/contribute/ptransform-style-guide.md
+++ b/src/contribute/ptransform-style-guide.md
@@ -452,39 +452,76 @@ public static class FooV2 {
 
 #### Validation
 
-* Validate individual parameters in `.withBlah()` methods. Error messages 
should mention the method being called, the actual value and the range of valid 
values.
-* Validate inter-parameter invariants in the `PTransform`'s `.validate()` 
method.
+* Validate individual parameters in `.withBlah()` methods using 
`checkArgument()`. Error messages should mention the name of the parameter, the 
actual value, and the range of valid values.
+* Validate parameter combinations and missing required parameters in the 
`PTransform`'s `.expand()` method.
+* Validate parameters that the `PTransform` takes from `PipelineOptions` in 
the `PTransform`'s `.validate(PipelineOptions)` method.
+  These validations will be executed when the pipeline is already fully 
constructed/expanded and is about to be run with a particular `PipelineOptions`.
+  Most `PTransform`s do not use `PipelineOptions` and thus don't need a 
`validate()` method - instead, they should perform their validation via the two 
other methods above.
 
 ```java
 @AutoValue
 public abstract class TwiddleThumbs
     extends PTransform<PCollection<Foo>, PCollection<Bar>> {
   abstract int getMoo();
-  abstract int getBoo();
+  abstract String getBoo();
 
   ...
   // Validating individual parameters
   public TwiddleThumbs withMoo(int moo) {
-    checkArgument(moo >= 0 && moo < 100,
-      "TwiddleThumbs.withMoo() called with an invalid moo of %s. "
-              + "Valid values are 0 (exclusive) to 100 (exclusive)",
-              moo);
-        return toBuilder().setMoo(moo).build();
+    checkArgument(
+        moo >= 0 && moo < 100,
+        "Moo must be between 0 (inclusive) and 100 (exclusive), but was: %s",
+        moo);
+    return toBuilder().setMoo(moo).build();
   }
 
-  // Validating cross-parameter invariants
-  public void validate(PCollection<Foo> input) {
-    checkArgument(getMoo() == 0 || getBoo() == 0,
-      "TwiddleThumbs created with both .withMoo(%s) and .withBoo(%s). "
-      + "Only one of these must be specified.",
-      getMoo(), getBoo());
+  public TwiddleThumbs withBoo(String boo) {
+    checkArgument(boo != null, "Boo can not be null");
+    checkArgument(!boo.isEmpty(), "Boo can not be empty");
+    return toBuilder().setBoo(boo).build();
+  }
+
+  @Override
+  public void validate(PipelineOptions options) {
+    int woo = options.as(TwiddleThumbsOptions.class).getWoo();
+    checkArgument(
+       woo > getMoo(),
+      "Woo (%s) must be smaller than moo (%s)",
+      woo, getMoo());
+  }
+
+  @Override
+  public PCollection<Bar> expand(PCollection<Foo> input) {
+    // Validating that a required parameter is present
+    checkArgument(getBoo() != null, "Must specify boo");
+
+    // Validating a combination of parameters
+    checkArgument(
+        getMoo() == 0 || getBoo() == null,
+        "Must specify at most one of moo or boo, but was: moo = %s, boo = %s",
+        getMoo(), getBoo());
+
+    ...
   }
 }
 ```
 
 #### Coders
 
-* Use `Coder`s only for setting the coder on a `PCollection` or a mutable 
state cell.
-* When available, use a specific most efficient coder for the datatype (e.g. 
`StringUtf8Coder.of()` for strings, `ByteArrayCoder.of()` for byte arrays, 
etc.), rather than using a generic coder like `SerializableCoder`. Develop 
efficient coders for types that can be elements of `PCollection`s.
-* Do not use coders as a general serialization or parsing mechanism for 
arbitrary raw byte data. (anti-examples that should be fixed: `TextIO`, 
`KafkaIO`).
-* In general, any transform that outputs a user-controlled type (that is not 
its input type) needs to accept a coder in the transform configuration 
(example: the `Create.of()` transform). This gives the user the ability to 
control the coder no matter how the transform is structured: e.g., purely 
letting the user specify the coder on the output `PCollection` of the transform 
is insufficient in case the transform internally uses intermediate 
`PCollection`s of this type.
+`Coder`s are a way for a Beam runner to materialize intermediate data or 
transmit it between workers when necessary. `Coder` should not be used as a 
general-purpose API for parsing or writing binary formats because the 
particular binary encoding of a `Coder` is intended to be its private 
implementation detail.
+
+##### Providing default coders for types
+
+Provide default `Coder`s for all new data types. Use `@DefaultCoder` 
annotations or `CoderProviderRegistrar` classes annotated with `@AutoService`: 
see usages of these classes in the SDK for examples. If performance is not 
important, you can use `SerializableCoder` or `AvroCoder`. Otherwise, develop 
an efficient custom coder (subclass `AtomicCoder` for concrete types, 
`StructuredCoder` for generic types).
+
+##### Setting coders on output collections
+
+All `PCollection`s created by your `PTransform` (both output and intermediate 
collections) must have a `Coder` set on them: a user should never need to call 
`.setCoder()` to "fix up" a coder on a `PCollection` produced by your 
`PTransform` (in fact, Beam intends to eventually deprecate `setCoder`). In 
some cases, coder inference will be sufficient to achieve this; in other cases, 
your transform will need to explicitly call `setCoder` on its collections.
+
+If the collection is of a concrete type, that type usually has a corresponding 
coder. Use a specific most efficient coder (e.g. `StringUtf8Coder.of()` for 
strings, `ByteArrayCoder.of()` for byte arrays, etc.), rather than a 
general-purpose coder like `SerializableCoder`.
+
+If the type of the collection involves generic type variables, the situation 
is more complex:
+* If it coincides with the transform's input type or is a simple wrapper over 
it, you can reuse the coder of the input `PCollection`, available via 
`input.getCoder()`.
+* Attempt to infer the coder via 
`input.getPipeline().getCoderRegistry().getCoder(TypeDescriptor)`. Use 
utilities in `TypeDescriptors` to obtain the `TypeDescriptor` for the generic 
type. For an example of this approach, see the implementation of 
`AvroIO.parseGenericRecords()`. However, coder inference for generic types is 
best-effort and in some cases it may fail due to Java type erasure.
+* Always make it possible for the user to explicitly specify a `Coder` for the 
relevant type variable(s) as a configuration parameter of your `PTransform`. 
(e.g. `AvroIO.<T>parseGenericRecords().withCoder(Coder<T>)`). Fall back to 
inference if the coder was not explicitly specified.
+

-- 
To stop receiving notification emails like this one, please contact
"commits@beam.apache.org" <commits@beam.apache.org>.

[beam-site] 01/03: Expands the sections on Coders and Validation in Style Guide

Reply via email to