Re: New blog post: "Stateful processing with Apache Beam"

Beau Fabry Tue, 21 Feb 2017 17:10:22 -0800

Can someone just confirm my understanding of when to use the various KV
combination methods:


  * GroupByKey/CoGroupByKey -- If there's no incremental reduction of your
inputs possible, and no useful information can be communicated between
outputs
  * Combine.perKey -- use if you can reduce your inputs in parallel, but no
useful information can be communicated between extractOutput calls
  * Stateful DoFn -- use if you cannot reduce your inputs in parallel, and
need to communicate information between outputs

In times when communicating between outputs would be an optimisation, and
reducing inputs in parallel would be an optimisation, how do you make the
call between using a stateful dofn or a Combine.perKey? Is there a case to
be made that CombineFn could also get access to state within extractOutput?
It seems like the only benefit to a CombineFn now is that the merge and
addInput steps can run on multiple workers, is there a rule of thumb to
know if we have enough data per key that that is significant?

Cheers,
Beau

On Wed, Feb 15, 2017 at 8:53 AM Ismaël Mejía <[email protected]> wrote:

> Great post, I like the use of the previous figure style with geometric
> forms and colors, as well as the table analogy that really helps to
> understand the concepts. I am still digesting some of the consequences of
> the State API, in particular the implications of using state that you
> mention at the end. Really good to discuss those also as part of the post.
>
> I found some small typos and formatting issues that I addressed here.
> https://github.com/apache/beam-site/pull/156
> <https://github.com/apache/beam-site/pull/156>
>
> Thanks for writing,
> Ismaël
>
>
> On Tue, Feb 14, 2017 at 11:50 AM, Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> Hey Ken
>
> Just take a quick look and it's a great post !
>
> Thanks
> Regards
> JB
> On Feb 13, 2017, at 18:44, Kenneth Knowles <[email protected]> wrote:
>
> Hi all,
>
> I've just published a blog post about Beam's new stateful processing
> capabilities:
>
>     https://beam.apache.org/blog/2017/02/13/stateful-processing.html
> <https://beam.apache.org/blog/2017/02/13/stateful-processing.html>
>
> The blog post covers stateful processing from a few angles: how it works,
> how it fits into the Beam model, what you might use it for, and finally
> some examples of stateful Beam code.
>
> I'd love for you to take a look and see how this feature might apply to
> your use of Beam. And, of course, I'd also love to hear from you about it.
>
> Kenn
>
>
>

Re: New blog post: "Stateful processing with Apache Beam"

Reply via email to