Triggering is quite explicitly intended to only control latency/cost tradeoffs, not semantics like this, even though it does have some similar visible effects.
Windowing is to divide into groupings that are eventually completed, so could be used for most reasons a key would be used (few runners support this efficiently, so please also use keys) but not properties as dynamic as size of a growing file. Bundling is the unit of commitment, so files larger than a bundle would require deliberate attempts at appending (which we don't do). Triggering does end up controlling what is in a bundle, assuming the runner attempts to fire triggers quickly. Both are too generic to be the right choice for an IO like this. I think this would be a feature for FileIO. I'd guess you want a intra-bundle splitting of the files. Seems useful to me. Kenn On Wed, Jan 16, 2019 at 12:05 PM Jeff Klukas <[email protected]> wrote: > Related to a previous thread about custom triggering on GlobalWindows [0], > are there general recommendations for controlling size of output files from > FileIO.Write? > > A general pattern I've seen in systems that need to batch individual > records to files is that they offer both a maximum file size and a maximum > latency. If you specify 1 GB and 1 minute respectively, the system would > create multiple 1 GB files per minute when throughput is high, and a single > smaller file per minute when throughput is below 1 GB/minute. > > From the discussion in [0], it sounds like windowing and triggering > semantics are not sufficient to provide such guarantees. Bounded runners > are free to ignore triggers as being non-deterministic. Are there other > techniques I'm missing to limit files sizes, or is windowing on record > timestamp the only tool available that applies to both batch and streaming? > > [0] > https://lists.apache.org/thread.html/7b583c73d55d13389a49a35dec2b42128d114361de3c1f0822d9ded4@%3Cuser.beam.apache.org%3E >
