Repository: beam-site
Updated Branches:
  refs/heads/asf-site 47ad18557 -> acd643afb


Move extension README files to website


Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/8ed3f247
Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/8ed3f247
Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/8ed3f247

Branch: refs/heads/asf-site
Commit: 8ed3f247b8600b7651cffff5186ba145473e97ed
Parents: 47ad185
Author: Ahmet Altay <al...@google.com>
Authored: Tue May 9 17:47:01 2017 -0700
Committer: Ahmet Altay <al...@google.com>
Committed: Tue May 9 21:21:25 2017 -0700

----------------------------------------------------------------------
 src/documentation/runners/flink.md        |  2 +-
 src/documentation/sdks/java-extensions.md | 59 ++++++++++++++++++++++++++
 src/documentation/sdks/java.md            |  8 ++++
 3 files changed, 68 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/beam-site/blob/8ed3f247/src/documentation/runners/flink.md
----------------------------------------------------------------------
diff --git a/src/documentation/runners/flink.md 
b/src/documentation/runners/flink.md
index f844c6b..685917e 100644
--- a/src/documentation/runners/flink.md
+++ b/src/documentation/runners/flink.md
@@ -138,7 +138,7 @@ When executing your pipeline with the Flink Runner, you can 
set these pipeline o
 </tr>
 </table>
 
-See the reference documentation for the  <span 
class="language-java">[FlinkPipelineOptions]({{ site.baseurl 
}}/documentation/sdks/javadoc/{{ site.release_latest 
}}/index.html?org/apache/beam/runners/flink/FlinkPipelineOptions.html)</span><span
 
class="language-py">[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py)</span>
 interface (and its subinterfaces) for the complete list of pipeline 
configuration options.
+See the reference documentation for the  <span 
class="language-java">[FlinkPipelineOptions]({{ site.baseurl 
}}/documentation/sdks/javadoc/{{ site.release_latest 
}}/index.html?org/apache/beam/runners/flink/FlinkPipelineOptions.html)</span><span
 
class="language-py">[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/options/pipeline_options.py)</span>
 interface (and its subinterfaces) for the complete list of pipeline 
configuration options.
 
 ## Additional information and caveats
 

http://git-wip-us.apache.org/repos/asf/beam-site/blob/8ed3f247/src/documentation/sdks/java-extensions.md
----------------------------------------------------------------------
diff --git a/src/documentation/sdks/java-extensions.md 
b/src/documentation/sdks/java-extensions.md
new file mode 100644
index 0000000..a4694af
--- /dev/null
+++ b/src/documentation/sdks/java-extensions.md
@@ -0,0 +1,59 @@
+---
+layout: default
+title: "Beam Java SDK Extensions"
+permalink: /documentation/sdks/java-extensions/
+---
+# Apache Beam Java SDK Extensions
+
+## <a name="join-library"></a>Join-library
+
+Join-library provides inner join, outer left join, and outer right join 
functions. The aim
+is to simplify the most common cases of join to a simple function call.
+
+The functions are generic and support joins of any Beam-supported types.
+Input to the join functions are `PCollections` of `Key` / `Value`s. Both
+the left and right `PCollection`s need the same type for the key. All the join
+functions return a `Key` / `Value` where `Key` is the join key and value is
+a `Key` / `Value` where the key is the left value and right is the value.
+
+For outer joins, the user must provide a value that represents `null` because 
`null`
+cannot be serialized.
+
+Example usage:
+
+```
+PCollection<KV<String, String>> leftPcollection = ...
+PCollection<KV<String, Long>> rightPcollection = ...
+
+PCollection<KV<String, KV<String, Long>>> joinedPcollection =
+  Join.innerJoin(leftPcollection, rightPcollection);
+```
+
+
+## <a name="sorter"></a>Sorter
+
+This module provides the `SortValues` transform, which takes a 
`PCollection<KV<K, Iterable<KV<K2, V>>>>` and produces a `PCollection<KV<K, 
Iterable<KV<K2, V>>>>` where, for each primary key `K` the paired 
`Iterable<KV<K2, V>>` has been sorted by the byte encoding of secondary key 
(`K2`). It is an efficient and scalable sorter for iterables, even if they are 
large (do not fit in memory).
+
+### Caveats
+
+- This transform performs value-only sorting; the iterable accompanying each 
key is sorted, but *there is no relationship between different keys*, as Beam 
does not support any defined relationship between different elements in a 
`PCollection`.
+* Each `Iterable<KV<K2, V>>` is sorted on a single worker using local memory 
and disk. This means that `SortValues` may be a performance and/or scalability 
bottleneck when used in different pipelines. For example, users are discouraged 
from using `SortValues` on a `PCollection` of a single element to globally sort 
a large `PCollection`. A (rough) estimate of the number of bytes of disk space 
utilized if sorting spills to disk is `numRecords * 
(numSecondaryKeyBytesPerRecord + numValueBytesPerRecord + 16) * 3`.
+
+### Options
+
+* The user can customize the temporary location used if sorting requires 
spilling to disk and the maximum amount of memory to use by creating a custom 
instance of `BufferedExternalSorter.Options` to pass into `SortValues.create`.
+
+### Example usage of `SortValues`
+
+```
+PCollection<KV<String, KV<String, Integer>>> input = ...
+
+// Group by primary key, bringing <SecondaryKey, Value> pairs for the same key 
together.
+PCollection<KV<String, Iterable<KV<String, Integer>>>> grouped =
+    input.apply(GroupByKey.<String, KV<String, Integer>>create());
+
+// For every primary key, sort the iterable of <SecondaryKey, Value> pairs by 
secondary key.
+PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted =
+    grouped.apply(
+        SortValues.<String, String, Integer>create(new 
BufferedExternalSorter.Options()));
+```

http://git-wip-us.apache.org/repos/asf/beam-site/blob/8ed3f247/src/documentation/sdks/java.md
----------------------------------------------------------------------
diff --git a/src/documentation/sdks/java.md b/src/documentation/sdks/java.md
index 3275914..7cfe96f 100644
--- a/src/documentation/sdks/java.md
+++ b/src/documentation/sdks/java.md
@@ -23,3 +23,11 @@ The Java SDK supports all features currently supported by 
the Beam model.
 
 ## Pipeline I/O
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/) page for a list of the currently available I/O 
transforms.
+
+
+## Extensions
+
+The Java SDK has the following extensions:
+
+- 
[join-library]({{site.baseurl}}/documentation/sdks/java-extensions/#join-library)
 provides inner join, outer left join, and outer right join functions.
+- [sorter]({{site.baseurl}}/documentation/sdks/java-extensions/#sorter) is an 
efficient and scalable sorter for large iterables.

Reply via email to