You can apply the same DoFn / Transform instance multiple times in the graph or you can follow regular development practices where the common code is factored into a method and two different DoFn's invoke it.
On Tue, Jun 23, 2020 at 4:28 PM Praveen K Viswanathan < [email protected]> wrote: > Hi Luke - Thanks for the explanation. The limitation due to directed graph > processing and the option of external storage clears most of the questions > I had with respect to designing this pipeline. I do have one more scenario > to clarify on this thread. > > If I had a certain piece of logic that I had to use in more than one DoFns > how do we do that. In a normal Java application, we can put it as a > separate method and call it wherever we want. Is it possible to replicate > something like that in Beam's DoFn? > > On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik <[email protected]> wrote: > >> Beam is really about parallelizing the processing. Using a single DoFn >> that does everything is fine as long as the DoFn can process elements in >> parallel (e.g. upstream source produces lots of elements). Composing >> multiple DoFns is great for re-use and testing but it isn't strictly >> necessary. Also, Beam doesn't support back edges in the processing graph so >> all data flows in one direction and you can't have a cycle. This only >> allows for map 1 to producie map 2 which then produces map 3 which is then >> used to update map 1 if all of that logic is within a single DoFn/Transform >> or you create a cycle using an external system such as write to Kafka topic >> X and read from Kafka topic X within the same pipeline or update a database >> downstream from where it is read. There is a lot of ordering complexity and >> stale data issues whenever using an external store to create a cycle though. >> >> On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan < >> [email protected]> wrote: >> >>> Another way to put this question is, how do we write a beam pipeline for >>> an existing pipeline (in Java) that has a dozen of custom objects and you >>> have to work with multiple HashMaps of those custom objects in order to >>> transform it. Currently, I am writing a beam pipeline by using the same >>> Custom objects, getters and setters and HashMap<CustomObjects> *but >>> inside a DoFn*. Is this the optimal way or does Beam offer something >>> else? >>> >>> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan < >>> [email protected]> wrote: >>> >>>> Hi Luke, >>>> >>>> We can say Map 2 as a kind of a template using which you want to enrich >>>> data in Map 1. As I mentioned in my previous post, this is a high level >>>> scenario. >>>> >>>> All these logic are spread across several classes (with ~4K lines of >>>> code in total). As in any Java application, >>>> >>>> 1. The code has been modularized with multiple method calls >>>> 2. Passing around HashMaps<CustomObject> as argument to each method >>>> 3. Accessing the attributes of the custom object using getters and >>>> setters. >>>> >>>> This is a common pattern in a normal Java application but I have not >>>> seen such an example of code in Beam. >>>> >>>> >>>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <[email protected]> wrote: >>>> >>>>> Who reads map 1? >>>>> Can it be stale? >>>>> >>>>> It is unclear what you are trying to do in parallel and why you >>>>> wouldn't stick all this logic into a single DoFn / stateful DoFn. >>>>> >>>>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < >>>>> [email protected]> wrote: >>>>> >>>>>> Hello Everyone, >>>>>> >>>>>> I am in the process of implementing an existing pipeline (written >>>>>> using Java and Kafka) in Apache Beam. The data from the source stream is >>>>>> contrived and had to go through several steps of enrichment using REST >>>>>> API >>>>>> calls and parsing of JSON data. The key >>>>>> transformation in the existing pipeline is in shown below (a super >>>>>> high level flow) >>>>>> >>>>>> *Method A* >>>>>> ----Calls *Method B* >>>>>> ----Creates *Map 1, Map 2* >>>>>> ----Calls *Method C* >>>>>> ----Read *Map 2* >>>>>> ----Create *Map 3* >>>>>> ----*Method C* >>>>>> ----Read *Map 3* and >>>>>> ----update *Map 1* >>>>>> >>>>>> The Map we use are multi-level maps and I am thinking of having >>>>>> PCollections for each Maps and pass them as side inputs in a DoFn >>>>>> wherever >>>>>> I have transformations that need two or more Maps. But there are certain >>>>>> tasks which I want to make sure that I am following right approach, for >>>>>> instance updating one of the side input maps inside a DoFn. >>>>>> >>>>>> These are my initial thoughts/questions and I would like to get some >>>>>> expert advice on how we typically design such an interleaved >>>>>> transformation >>>>>> in Apache Beam. Appreciate your valuable insights on this. >>>>>> >>>>>> -- >>>>>> Thanks, >>>>>> Praveen K Viswanathan >>>>>> >>>>> >>>> >>>> -- >>>> Thanks, >>>> Praveen K Viswanathan >>>> >>> >>> >>> -- >>> Thanks, >>> Praveen K Viswanathan >>> >> > > -- > Thanks, > Praveen K Viswanathan >
