Triggering is quite explicitly intended to only control latency/cost
tradeoffs, not semantics like this, even though it does have some similar
visible effects.

Windowing is to divide into groupings that are eventually completed, so
could be used for most reasons a key would be used (few runners support
this efficiently, so please also use keys) but not properties as dynamic as
size of a growing file.

Bundling is the unit of commitment, so files larger than a bundle would
require deliberate attempts at appending (which we don't do). Triggering
does end up controlling what is in a bundle, assuming the runner attempts
to fire triggers quickly.

Both are too generic to be the right choice for an IO like this. I think
this would be a feature for FileIO. I'd guess you want a intra-bundle
splitting of the files. Seems useful to me.

Kenn

On Wed, Jan 16, 2019 at 12:05 PM Jeff Klukas <[email protected]> wrote:

> Related to a previous thread about custom triggering on GlobalWindows [0],
> are there general recommendations for controlling size of output files from
> FileIO.Write?
>
> A general pattern I've seen in systems that need to batch individual
> records to files is that they offer both a maximum file size and a maximum
> latency. If you specify 1 GB and 1 minute respectively, the system would
> create multiple 1 GB files per minute when throughput is high, and a single
> smaller file per minute when throughput is below 1 GB/minute.
>
> From the discussion in [0], it sounds like windowing and triggering
> semantics are not sufficient to provide such guarantees. Bounded runners
> are free to ignore triggers as being non-deterministic. Are there other
> techniques I'm missing to limit files sizes, or is windowing on record
> timestamp the only tool available that applies to both batch and streaming?
>
> [0]
> https://lists.apache.org/thread.html/7b583c73d55d13389a49a35dec2b42128d114361de3c1f0822d9ded4@%3Cuser.beam.apache.org%3E
>

Reply via email to