Hi,

If the keys bother you, you can .apply(WithKeys.of("")) before the
GroupIntoBatches transform. This effectively removes parallelism as all
items are funneled through one executor.

Note that I think that GroupIntoBatches into batches might be broken on
Flink [1].

Alternatively, create your own stateful ParDo (needs a KV input) that
counts and outputs.

Hope it helps,
Cristian

[1] https://lists.apache.org/thread/xo714ntwxgpm199kqrwt9lzn40z882t1

On Wed, Aug 10, 2022 at 3:45 PM Shivam Singhal <[email protected]>
wrote:

> Is there no other way than
> https://stackoverflow.com/a/44956702 ?
>
> On Thu, 11 Aug 2022 at 1:00 AM, Shivam Singhal <
> [email protected]> wrote:
>
>> I have a PCollection of type KV<String, Byte[]> where each key in those
>> KVs is unique.
>>
>> I would like to split all those KV pairs into batches. This new
>> PCollection will be of type PCollection<Iterable<KV<String, Byte[]>>>>
>> where the iterable’s length can be configured.
>>
>>
>> I know there is a PTransform called GroupIntoBatches but it batches based
>> in the keys which is not my usecase.
>>
>> Will be great if someone could help in this.
>>
>> Thanks,
>> Shivam Singhal
>>
>

Reply via email to