[1/3] beam-site git commit: Added post for 0.4.0 release with Apex runner addition.

davor Mon, 09 Jan 2017 17:39:07 -0800

Repository: beam-site
Updated Branches:
  refs/heads/asf-site fe57db99a -> 5de55f266



Added post for 0.4.0 release with Apex runner addition.


Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/430a84c3
Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/430a84c3
Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/430a84c3

Branch: refs/heads/asf-site
Commit: 430a84c3ae9689be7ba3ba4e3b5ca83775118e1e
Parents: fe57db9
Author: Thomas Weise <t...@apache.org>
Authored: Sat Dec 31 10:10:28 2016 -0800
Committer: Davor Bonaci <da...@google.com>
Committed: Mon Jan 9 17:35:11 2017 -0800

----------------------------------------------------------------------
 src/_data/authors.yml                      |  4 +++
 src/_posts/2017-01-09-added-apex-runner.md | 39 +++++++++++++++++++++++++
 2 files changed, 43 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/beam-site/blob/430a84c3/src/_data/authors.yml
----------------------------------------------------------------------
diff --git a/src/_data/authors.yml b/src/_data/authors.yml
index ffcb419..147dd7b 100644
--- a/src/_data/authors.yml
+++ b/src/_data/authors.yml
@@ -32,3 +32,7 @@ tgroh:
 jesseanderson:
     name: Jesse Anderson
     twitter: jessetanderson
+thw:
+    name: Thomas Weise
+    email: t...@apache.org
+    twitter: thweise

http://git-wip-us.apache.org/repos/asf/beam-site/blob/430a84c3/src/_posts/2017-01-09-added-apex-runner.md
----------------------------------------------------------------------
diff --git a/src/_posts/2017-01-09-added-apex-runner.md 
b/src/_posts/2017-01-09-added-apex-runner.md
new file mode 100644
index 0000000..93c00ed
--- /dev/null
+++ b/src/_posts/2017-01-09-added-apex-runner.md
@@ -0,0 +1,39 @@
+---
+layout: post
+title:  "Release 0.4.0 adds a runner for Apache Apex"
+date:   2016-01-09 00:00:01 -0700
+excerpt_separator: <!--more-->
+categories: blog
+authors:
+  - thw
+---
+
+The latest release 0.4.0 of [Apache Beam](https://beam.apache.org) adds a new 
runner for [Apache Apex](http://apex.apache.org/). We are excited to reach this 
initial milestone and are looking forward to continued collaboration between 
the Beam and Apex communities to advance the runner.
+
+<!--more-->
+
+Beam evolved from the Google Dataflow SDK and as incubator project has quickly 
adapted the Apache way, grown the community and attracts increasing interest 
from users that hope to benefit from a conceptual strong unified programming 
model that is portable between different big data processing frameworks (see 
[Streaming-101](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101)
 and 
[Streaming-102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102)).
 Multiple Apache projects already provide runners for Beam (see [runners and 
capabilities 
matrix](http://beam.apache.org/documentation/runners/capability-matrix/)).
+
+Apex is a stream processing framework for low-latency, high-throughput, 
stateful and reliable processing of complex analytics pipelines on clusters. 
Apex was developed since 2012 and is used in production by large companies for 
real-time and batch processing at scale.
+
+The initial revision of the runner was focussed on broad coverage of the Beam 
model on a functional level. That means, there will be follow up work in 
several areas to take the runner from functional to scalable and high 
performance to match the capabilities of Apex and its native API. The runner 
capabilities matrix shows that the Apex capabilities are well aligned with the 
Beam model. Specifically, the ability to track computational state in a fault 
tolerant and efficient manner is needed to broadly support the windowing 
concepts, including event time based processing.
+
+## Stateful Stream Processor
+
+Apex was built as stateful stream processor from the ground up. Operators 
[checkpoint](https://www.datatorrent.com/blog/blog-introduction-to-checkpoint/) 
state in a distributed and asynchronous manner that produces a consistent 
snapshot for the entire processing graph, which can be used for recovery. Apex 
also supports such recovery in an incremental, or fine grained, manner. This 
means only the portion of the DAG that is actually affected by a failure will 
be recovered while the remaining pipeline continues processing (this can be 
leveraged to implement use cases with special needs, such as speculative 
execution to achieve SLA on the processing latency). The state checkpointing 
along with idempotent processing guarantee is the basis for [exactly-once 
results](https://www.datatorrent.com/blog/end-to-end-exactly-once-with-apache-apex/)
 support in Apex.
+
+## Translation to Apex DAG
+
+A Beam runner needs to implement the translation from the Beam model to the 
underlying frameworks execution model. In the case of Apex, the runner will 
translate the pipeline into the [native (compositional, low level) DAG 
API](https://www.datatorrent.com/blog/tracing-dags-from-specification-to-execution/)
 (which is also the base for a number of other API that are available to 
specify applications that run on Apex). The DAG consists of operators 
(functional building blocks that are connected with streams. The runner 
provides the execution layer. In the case of Apex it is distributed stream 
processing, operators process data event by event. The minimum set of operators 
covers Beamâs primitive transforms: `ParDo.Bound`,  `ParDo.BoundMulti`, 
`Read.Unbounded`, `Read.Bounded`, `GroupByKey`, 
`Flatten.FlattenPCollectionList` etc.
+
+## Execution and Testing
+
+In this release, the Apex runner executes the pipelines in embedded mode, 
where, similar to the direct runner, everything is executed in a single JVM. 
See [quickstart](https://beam.apache.org/get-started/quickstart/) on how to run 
the Beam examples with the Apex runner.
+
+Embedded mode is useful for development and debugging. Apex in production runs 
distributed on Apache Hadoop YARN clusters. An example how a Beam pipeline can 
be embedded into an Apex application package to run on YARN can be found 
[here](https://github.com/tweise/apex-samples/tree/master/beam-apex-wordcount) 
and support for direct launch in the runner is currently being worked on. 
+
+The Beam project has a strong focus on development process and tooling, 
including testing. For the runners, there is a comprehensive test suite with 
more than 200 integration tests that are executed against each runner to ensure 
they donât break as changes are made. The tests cover the capabilities of the 
matrix and thus are a measure of completeness and correctness of the runner 
implementations. The suite was very helpful when developing the Apex runner. 
+
+## Outlook
+
+The next step is to take the Apex runner from functional to ready for real 
applications that run distributed, leveraging the scalability and performance 
features of Apex, similar to its native API. This includes chaining of ParDos, 
partitioning, optimizing combine operations etc. To get involved, please see 
[JIRA](https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20runner-apex%20and%20resolution%20%3D%20unresolved)
 and join the Beam community.

[1/3] beam-site git commit: Added post for 0.4.0 release with Apex runner addition.

Reply via email to