You are quite correct that choosing the keys is important for work to be
evenly distributed. The reason you need to have a KvCoder is that state is
partitioned per key (to give natural & automatic parallelism) and window
(to allow reclaiming expired state so you can process unbounded data with
bounded storage, and also more parallelism). To a Beam runner, most data in
the pipeline is "just bytes" that it cannot interpret. KvCoder is a special
case where a runner knows the binary layout of encoded data so it can pull
out the keys in order to shuffle data of the same key to the same place, so
that is why it has to be a KvCoder.
On Mon, Feb 12, 2018 at 5:52 AM, Carlos Alonso <car...@mrcalonso.com> wrote:
> I was refactoring my solution a bit and tried to make my stateful
> transform to work on simple case classes and I got this exception:
> https://pastebin.com/x4xADmvL . I'd like to understand the rationale
> behind this as I think carefully choosing the keys would be very important
> in order for the work to be properly distributed.