Hi Robert, Kenneth. Thanks a lot to both of you for your responses!!
Kenneth, unfortunately I'm not sure we're experienced enough with Apache Beam to get anywhere close to your suggestion, but thanks anyway!! Robert, your suggestion sounds great to me, could you please provide any example on how to use that 'metadata driven' trigger? Thanks! On Tue, Jan 9, 2018 at 9:11 PM Kenneth Knowles <[email protected]> wrote: > Often, when you need or want more control than triggers provide, such as > input-type-specific logic like yours, you can use state and timers in ParDo > to control when to output. You lose any potential optimizations of Combine > based on associativity/commutativity and assume the burden of making sure > your output is sensible, but dropping to low-level stateful computation may > be your best bet. > > Kenn > > On Tue, Jan 9, 2018 at 11:59 AM, Robert Bradshaw <[email protected]> > wrote: > >> We've tossed around the idea of "metadata-driven" triggers which would >> essentially let you provide a mapping element -> metadata and a >> monotonic CombineFn metadata* -> bool that would allow for this (the >> AfterCount being a special case of this, with the mapping fn being _ >> -> 1, and the CombineFn being sum(...) >= N, for size one would >> provide a (perhaps approximate) sizing mapping fn). >> >> Note, however, that there's no guarantee that the trigger fire as soon >> as possible; due to runtime characteristics a significant amount of >> data may be buffered (or come in at once) before the trigger is >> queried. One possibility would be to follow your triggering with a >> DoFn that breaks up large value streams into multiple manageable sized >> ones as needed. >> >> On Tue, Jan 9, 2018 at 11:43 AM, Carlos Alonso <[email protected]> >> wrote: >> > Hi everyone!! >> > >> > I was wondering if there is an option to trigger window panes based on >> the >> > size of the pane itself (rather than the number of elements). >> > >> > To provide a little bit more of context we're backing up a PubSub topic >> into >> > GCS with the "special" feature that, depending on the "type" of the >> message, >> > the GCS destination is one or another. >> > >> > Messages' 'shape' published there is quite random, some of them are very >> > frequent and small, some others very big but sparse... We have around >> 150 >> > messages per second (in total) and we're firing every 15 minutes and >> > experiencing OOM errors, we've considered firing based on the number of >> > items as well, but given the randomness of the input, I don't think it >> will >> > be a final solution either. >> > >> > Having a trigger based on size would be great, another option would be >> to >> > have a dynamic shards number for the PTransform that actually writes the >> > files. >> > >> > What is your recommendation for this use case? >> > >> > Thanks!! >> > >
