Re: New blog post: "Stateful processing with Apache Beam"

Kenneth Knowles Tue, 21 Feb 2017 20:04:25 -0800

Hi Beau,

I've answered inline.

On Tue, Feb 21, 2017 at 5:09 PM, Beau Fabry <[email protected]> wrote:
>
> In times when communicating between outputs would be an optimisation, and
> reducing inputs in parallel would be an optimisation, how do you make the
> call between using a stateful dofn or a Combine.perKey?
>

(Standard optimization caveat: you'd make the call by measuring them both)

Usually you will be forced to make this decision not based on optimization,
but based on whether your use case can be expressed naturally as one or the
other.

The key difference is just the one I talk about in the blog post, the
associative `merge` operation. This enables nice optimizations so if your
use case fits nicely as a CombineFn then you should probably use it. There
are exceptions, like fine-grained access to just a piece of your
accumulator state, but the most common way to combine things is with
Combine :-)

You can always follow a Combine.perKey with a stateful ParDo that
communicates between outputs, if that also presents a major savings.

Is there a case to be made that CombineFn could also get access to state
> within extractOutput? It seems like the only benefit to a CombineFn now is
> that the merge and addInput steps can run on multiple workers, is there a
> rule of thumb to know if we have enough data per key that that is
> significant?
>

This is a big benefit a lot of the time, especially in large batch
executions. Without this optimization opportunity, a Combine.perKey would
execute by first doing a GroupByKey - shuffling all of your data - and then
applying your CombineFn to each KV<K, Iterable<V>>. But instead, this can
be optimized so that the combine happens before the shuffle, which can
reduce the shuffled data to a single accumulator per key. This optimization
is general enough that my description applies to all runners, though it has
lesser benefits in streaming executions where there is not as much data per
key+window+triggering.

Kenn

>
> Cheers,
> Beau
>
> On Wed, Feb 15, 2017 at 8:53 AM Ismaël Mejía <[email protected]> wrote:
>
>> Great post, I like the use of the previous figure style with geometric
>> forms and colors, as well as the table analogy that really helps to
>> understand the concepts. I am still digesting some of the consequences of
>> the State API, in particular the implications of using state that you
>> mention at the end. Really good to discuss those also as part of the post.
>>
>> I found some small typos and formatting issues that I addressed here.
>> https://github.com/apache/beam-site/pull/156
>>
>> Thanks for writing,
>> Ismaël
>>
>>
>> On Tue, Feb 14, 2017 at 11:50 AM, Jean-Baptiste Onofré <[email protected]>
>> wrote:
>>
>> Hey Ken
>>
>> Just take a quick look and it's a great post !
>>
>> Thanks
>> Regards
>> JB
>> On Feb 13, 2017, at 18:44, Kenneth Knowles <[email protected]> wrote:
>>
>> Hi all,
>>
>> I've just published a blog post about Beam's new stateful processing
>> capabilities:
>>
>>     https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>>
>> The blog post covers stateful processing from a few angles: how it
>> works, how it fits into the Beam model, what you might use it for, and
>> finally some examples of stateful Beam code.
>>
>> I'd love for you to take a look and see how this feature might apply to
>> your use of Beam. And, of course, I'd also love to hear from you about it.
>>
>> Kenn
>>
>>
>>

Re: New blog post: "Stateful processing with Apache Beam"

Reply via email to