Hi Steve,
Yes that's correct.
On Fri, May 15, 2020 at 2:11 PM Steve Niemitz wrote:
> ah! ok awesome, I think that was the piece I was misunderstanding. So I
> _can_ use a SDF to split the work initially (like I was manually doing in
> #1), but it just won't be further split dynamically on
ah! ok awesome, I think that was the piece I was misunderstanding. So I
_can_ use a SDF to split the work initially (like I was manually doing in
#1), but it just won't be further split dynamically on dataflow v1 right
now. Is my understanding there correct?
On Fri, May 15, 2020 at 5:03 PM Luke
#3 is the best when you implement @SplitRestriction on the SDF.
The size of each restriction is used to better balance the splits within
Dataflow runner v2 so it is less susceptible to the too many or unbalanced
split problem.
For example, if you have 4 workers and make 20 splits, the splits will
Thanks for the replies so far. I should have specifically mentioned above,
I am building a bounded source.
While I was thinking this through, I realized that I might not actually
need any fancy splitting, since I can calculate all my split points up
front. I think this goes well with Ismaël's
For the Bounded case if you do not have a straight forward way to split at
fractions, or simply if you do not care about Dynamic Work Rebalancing. You can
get away implementing a simple DoFn (without Restrictions) based implementation
and evolve from it. More and more IOs at Beam are becoming DoFn
If it is an unbounded source then SDF is a winner since you are not giving
up anything with it when compared to the legacy UnboundedSource API since
Dataflow doesn't support dynamic splitting of unbounded SDFs or
UnboundedSources (only initial splitting). You gain the ability to compose
sources
I'm going to be writing a new IO (in java) for reading files in a custom
format, and want to make it splittable. It seems like I have a choice
between the "legacy" source API, and newer experimental SDF API. Is there
any guidance on which I should use? I can likely tolerate some API churn
as