I have a question about this. My scenario is: * PubSub input with a timestampAttribute named doc_timestamp * Fixed windowing of one hour size. * Keys are an internal attribute of the messages (the type) * Messages of one particular type are way more frequent than the others, so it is likely a hot key
Will it help if I add a string representation of the doc_timestamp (to the hour) to the key, in order to increase the range of keys and therefore make it more parallelisable? I wonder if it will help or not, as the final result will be the same (type + window), but not sure if it would help before the point where the windowing is applied. Thanks! On Mon, Feb 12, 2018 at 6:41 PM Carlos Alonso <[email protected]> wrote: > Ok, that makes a lot of sense. Thanks Kenneth! > > On Mon, Feb 12, 2018 at 5:41 PM Kenneth Knowles <[email protected]> wrote: > >> Hi Carlos, >> >> You are quite correct that choosing the keys is important for work to be >> evenly distributed. The reason you need to have a KvCoder is that state is >> partitioned per key (to give natural & automatic parallelism) and window >> (to allow reclaiming expired state so you can process unbounded data with >> bounded storage, and also more parallelism). To a Beam runner, most data in >> the pipeline is "just bytes" that it cannot interpret. KvCoder is a special >> case where a runner knows the binary layout of encoded data so it can pull >> out the keys in order to shuffle data of the same key to the same place, so >> that is why it has to be a KvCoder. >> >> Kenn >> >> On Mon, Feb 12, 2018 at 5:52 AM, Carlos Alonso <[email protected]> >> wrote: >> >>> I was refactoring my solution a bit and tried to make my stateful >>> transform to work on simple case classes and I got this exception: >>> https://pastebin.com/x4xADmvL . I'd like to understand the rationale >>> behind this as I think carefully choosing the keys would be very important >>> in order for the work to be properly distributed. >>> >>> Thanks! >>> >> >>
