Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Boyuan Zhang
Hi Steve, Yes that's correct. On Fri, May 15, 2020 at 2:11 PM Steve Niemitz wrote: > ah! ok awesome, I think that was the piece I was misunderstanding. So I > _can_ use a SDF to split the work initially (like I was manually doing in > #1), but it just won't be further split dynamically on

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Steve Niemitz
ah! ok awesome, I think that was the piece I was misunderstanding. So I _can_ use a SDF to split the work initially (like I was manually doing in #1), but it just won't be further split dynamically on dataflow v1 right now. Is my understanding there correct? On Fri, May 15, 2020 at 5:03 PM Luke

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Luke Cwik
#3 is the best when you implement @SplitRestriction on the SDF. The size of each restriction is used to better balance the splits within Dataflow runner v2 so it is less susceptible to the too many or unbalanced split problem. For example, if you have 4 workers and make 20 splits, the splits will

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Steve Niemitz
Thanks for the replies so far. I should have specifically mentioned above, I am building a bounded source. While I was thinking this through, I realized that I might not actually need any fancy splitting, since I can calculate all my split points up front. I think this goes well with Ismaël's

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Ismaël Mejía
For the Bounded case if you do not have a straight forward way to split at fractions, or simply if you do not care about Dynamic Work Rebalancing. You can get away implementing a simple DoFn (without Restrictions) based implementation and evolve from it. More and more IOs at Beam are becoming DoFn

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Luke Cwik
If it is an unbounded source then SDF is a winner since you are not giving up anything with it when compared to the legacy UnboundedSource API since Dataflow doesn't support dynamic splitting of unbounded SDFs or UnboundedSources (only initial splitting). You gain the ability to compose sources

Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Steve Niemitz
I'm going to be writing a new IO (in java) for reading files in a custom format, and want to make it splittable. It seems like I have a choice between the "legacy" source API, and newer experimental SDF API. Is there any guidance on which I should use? I can likely tolerate some API churn as