Repository: beam-site Updated Branches: refs/heads/asf-site 47ad18557 -> acd643afb
Move extension README files to website Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/8ed3f247 Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/8ed3f247 Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/8ed3f247 Branch: refs/heads/asf-site Commit: 8ed3f247b8600b7651cffff5186ba145473e97ed Parents: 47ad185 Author: Ahmet Altay <al...@google.com> Authored: Tue May 9 17:47:01 2017 -0700 Committer: Ahmet Altay <al...@google.com> Committed: Tue May 9 21:21:25 2017 -0700 ---------------------------------------------------------------------- src/documentation/runners/flink.md | 2 +- src/documentation/sdks/java-extensions.md | 59 ++++++++++++++++++++++++++ src/documentation/sdks/java.md | 8 ++++ 3 files changed, 68 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/beam-site/blob/8ed3f247/src/documentation/runners/flink.md ---------------------------------------------------------------------- diff --git a/src/documentation/runners/flink.md b/src/documentation/runners/flink.md index f844c6b..685917e 100644 --- a/src/documentation/runners/flink.md +++ b/src/documentation/runners/flink.md @@ -138,7 +138,7 @@ When executing your pipeline with the Flink Runner, you can set these pipeline o </tr> </table> -See the reference documentation for the <span class="language-java">[FlinkPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/flink/FlinkPipelineOptions.html)</span><span class="language-py">[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py)</span> interface (and its subinterfaces) for the complete list of pipeline configuration options. +See the reference documentation for the <span class="language-java">[FlinkPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/flink/FlinkPipelineOptions.html)</span><span class="language-py">[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/options/pipeline_options.py)</span> interface (and its subinterfaces) for the complete list of pipeline configuration options. ## Additional information and caveats http://git-wip-us.apache.org/repos/asf/beam-site/blob/8ed3f247/src/documentation/sdks/java-extensions.md ---------------------------------------------------------------------- diff --git a/src/documentation/sdks/java-extensions.md b/src/documentation/sdks/java-extensions.md new file mode 100644 index 0000000..a4694af --- /dev/null +++ b/src/documentation/sdks/java-extensions.md @@ -0,0 +1,59 @@ +--- +layout: default +title: "Beam Java SDK Extensions" +permalink: /documentation/sdks/java-extensions/ +--- +# Apache Beam Java SDK Extensions + +## <a name="join-library"></a>Join-library + +Join-library provides inner join, outer left join, and outer right join functions. The aim +is to simplify the most common cases of join to a simple function call. + +The functions are generic and support joins of any Beam-supported types. +Input to the join functions are `PCollections` of `Key` / `Value`s. Both +the left and right `PCollection`s need the same type for the key. All the join +functions return a `Key` / `Value` where `Key` is the join key and value is +a `Key` / `Value` where the key is the left value and right is the value. + +For outer joins, the user must provide a value that represents `null` because `null` +cannot be serialized. + +Example usage: + +``` +PCollection<KV<String, String>> leftPcollection = ... +PCollection<KV<String, Long>> rightPcollection = ... + +PCollection<KV<String, KV<String, Long>>> joinedPcollection = + Join.innerJoin(leftPcollection, rightPcollection); +``` + + +## <a name="sorter"></a>Sorter + +This module provides the `SortValues` transform, which takes a `PCollection<KV<K, Iterable<KV<K2, V>>>>` and produces a `PCollection<KV<K, Iterable<KV<K2, V>>>>` where, for each primary key `K` the paired `Iterable<KV<K2, V>>` has been sorted by the byte encoding of secondary key (`K2`). It is an efficient and scalable sorter for iterables, even if they are large (do not fit in memory). + +### Caveats + +- This transform performs value-only sorting; the iterable accompanying each key is sorted, but *there is no relationship between different keys*, as Beam does not support any defined relationship between different elements in a `PCollection`. +* Each `Iterable<KV<K2, V>>` is sorted on a single worker using local memory and disk. This means that `SortValues` may be a performance and/or scalability bottleneck when used in different pipelines. For example, users are discouraged from using `SortValues` on a `PCollection` of a single element to globally sort a large `PCollection`. A (rough) estimate of the number of bytes of disk space utilized if sorting spills to disk is `numRecords * (numSecondaryKeyBytesPerRecord + numValueBytesPerRecord + 16) * 3`. + +### Options + +* The user can customize the temporary location used if sorting requires spilling to disk and the maximum amount of memory to use by creating a custom instance of `BufferedExternalSorter.Options` to pass into `SortValues.create`. + +### Example usage of `SortValues` + +``` +PCollection<KV<String, KV<String, Integer>>> input = ... + +// Group by primary key, bringing <SecondaryKey, Value> pairs for the same key together. +PCollection<KV<String, Iterable<KV<String, Integer>>>> grouped = + input.apply(GroupByKey.<String, KV<String, Integer>>create()); + +// For every primary key, sort the iterable of <SecondaryKey, Value> pairs by secondary key. +PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted = + grouped.apply( + SortValues.<String, String, Integer>create(new BufferedExternalSorter.Options())); +``` http://git-wip-us.apache.org/repos/asf/beam-site/blob/8ed3f247/src/documentation/sdks/java.md ---------------------------------------------------------------------- diff --git a/src/documentation/sdks/java.md b/src/documentation/sdks/java.md index 3275914..7cfe96f 100644 --- a/src/documentation/sdks/java.md +++ b/src/documentation/sdks/java.md @@ -23,3 +23,11 @@ The Java SDK supports all features currently supported by the Beam model. ## Pipeline I/O See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. + + +## Extensions + +The Java SDK has the following extensions: + +- [join-library]({{site.baseurl}}/documentation/sdks/java-extensions/#join-library) provides inner join, outer left join, and outer right join functions. +- [sorter]({{site.baseurl}}/documentation/sdks/java-extensions/#sorter) is an efficient and scalable sorter for large iterables.