Dropping redundant updates without the state API

Beau Fabry Wed, 10 Aug 2016 10:56:07 -0700

I'm looking for any information anyone can provide on the strategy hinted
at here
http://stackoverflow.com/questions/38775173/can-a-once-firing-trigger-be-used-to-reduce-data-volume
for
using CombinePerKey as a poor man's state API. The only thing I can think
of is modifying the AccumT object inside of the extractOutput method, but
that feels a little dangerous and I want to confirm that I won't get any
surprises.


The issue is that our dataflow job is consuming the binlog of a database,
where most of the data update events don't actually update any field that
would affect the calculation, with most of the aggregation points having a
global window that triggers on each new element, which means our output is
currently correct, but we are updating many orders of magnitude more times
than is required.

Dropping redundant updates without the state API

Reply via email to