Side note: The Python classes I linked implement SDFs, which still don't
work on Flink, hence they still need a shuffle despite being splittable.
Would be interesting to see if it works on Dataflow (just the splitting,
without the explicit shuffle), but I cannot test it myself. So if anyone
wants to give it a try, I'd be interested in the results.
Janek
On 22/03/2022 16:19, Jack McCluskey wrote:
My gut instinct is that Cham's reshuffle suggestion is going to be the
better solution in this case since the input bundle is rather small,
but yes an SDF has some utility here if you wanted to go the extra
mile and let the runner split up the bundle. I don't have enough SDF
experience to say whether or not that extra effort would parallelize
the reads on the runner side though.