If I am connecting the threads properly, you are trying to simulate
triggering based on the size of buffered data, while you are using the type
of message to route messages to one GCS location or another. Yes?

On Tue, Feb 13, 2018 at 10:29 AM, Carlos Alonso <car...@mrcalonso.com>
wrote:

> I have a question about this. My scenario is:
>
> * PubSub input with a timestampAttribute named doc_timestamp
> * Fixed windowing of one hour size.
> * Keys are an internal attribute of the messages (the type)
> * Messages of one particular type are way more frequent than the others,
> so it is likely a hot key
>
> Will it help if I add a string representation of the doc_timestamp (to the
> hour) to the key, in order to increase the range of keys and therefore make
> it more parallelisable?
>

It might help, might not. I think it depends on bursty-ness. Technically,
the runner could already parallelize over the windows whenever multiple are
active for a key. No runner actually does this - they all just parallelize
over the key. So embedding the window in the key does make that parallelism
explicit. But for non-overlapping windows if you are streaming data at near
real time and the processing overlaps a lot, you are probably not keeping
up.

I think your best choice is probably following the example of
Combine.withHotKeyFanout to add sub-sharding to your hot key(s) and then
working out a way to write them out to appropriate locations.

Kenn


> I wonder if it will help or not, as the final result will be the same
> (type + window), but not sure if it would help before the point where the
> windowing is applied.
>
> Thanks!
>
>
> On Mon, Feb 12, 2018 at 6:41 PM Carlos Alonso <car...@mrcalonso.com>
> wrote:
>
>> Ok, that makes a lot of sense. Thanks Kenneth!
>>
>> On Mon, Feb 12, 2018 at 5:41 PM Kenneth Knowles <k...@google.com> wrote:
>>
>>> Hi Carlos,
>>>
>>> You are quite correct that choosing the keys is important for work to be
>>> evenly distributed. The reason you need to have a KvCoder is that state is
>>> partitioned per key (to give natural & automatic parallelism) and window
>>> (to allow reclaiming expired state so you can process unbounded data with
>>> bounded storage, and also more parallelism). To a Beam runner, most data in
>>> the pipeline is "just bytes" that it cannot interpret. KvCoder is a special
>>> case where a runner knows the binary layout of encoded data so it can pull
>>> out the keys in order to shuffle data of the same key to the same place, so
>>> that is why it has to be a KvCoder.
>>>
>>> Kenn
>>>
>>> On Mon, Feb 12, 2018 at 5:52 AM, Carlos Alonso <car...@mrcalonso.com>
>>> wrote:
>>>
>>>> I was refactoring my solution a bit and tried to make my stateful
>>>> transform to work on simple case classes and I got this exception:
>>>> https://pastebin.com/x4xADmvL . I'd like to understand the rationale
>>>> behind this as I think carefully choosing the keys would be very important
>>>> in order for the work to be properly distributed.
>>>>
>>>> Thanks!
>>>>
>>>
>>>

Reply via email to