Ok, that makes a lot of sense. Thanks Kenneth! On Mon, Feb 12, 2018 at 5:41 PM Kenneth Knowles <[email protected]> wrote:
> Hi Carlos, > > You are quite correct that choosing the keys is important for work to be > evenly distributed. The reason you need to have a KvCoder is that state is > partitioned per key (to give natural & automatic parallelism) and window > (to allow reclaiming expired state so you can process unbounded data with > bounded storage, and also more parallelism). To a Beam runner, most data in > the pipeline is "just bytes" that it cannot interpret. KvCoder is a special > case where a runner knows the binary layout of encoded data so it can pull > out the keys in order to shuffle data of the same key to the same place, so > that is why it has to be a KvCoder. > > Kenn > > On Mon, Feb 12, 2018 at 5:52 AM, Carlos Alonso <[email protected]> > wrote: > >> I was refactoring my solution a bit and tried to make my stateful >> transform to work on simple case classes and I got this exception: >> https://pastebin.com/x4xADmvL . I'd like to understand the rationale >> behind this as I think carefully choosing the keys would be very important >> in order for the work to be properly distributed. >> >> Thanks! >> > >
