Thanks Luke, I would like to try the latter approach. Would be able to share any pseudo-code or point to any example on how to call a common method inside a DoFn's, let's say, ProcessElement method?
On Tue, Jun 23, 2020 at 6:35 PM Luke Cwik <[email protected]> wrote: > You can apply the same DoFn / Transform instance multiple times in the > graph or you can follow regular development practices where the common code > is factored into a method and two different DoFn's invoke it. > > On Tue, Jun 23, 2020 at 4:28 PM Praveen K Viswanathan < > [email protected]> wrote: > >> Hi Luke - Thanks for the explanation. The limitation due to directed >> graph processing and the option of external storage clears most of the >> questions I had with respect to designing this pipeline. I do have one more >> scenario to clarify on this thread. >> >> If I had a certain piece of logic that I had to use in more than one >> DoFns how do we do that. In a normal Java application, we can put it as a >> separate method and call it wherever we want. Is it possible to replicate >> something like that in Beam's DoFn? >> >> On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik <[email protected]> wrote: >> >>> Beam is really about parallelizing the processing. Using a single DoFn >>> that does everything is fine as long as the DoFn can process elements in >>> parallel (e.g. upstream source produces lots of elements). Composing >>> multiple DoFns is great for re-use and testing but it isn't strictly >>> necessary. Also, Beam doesn't support back edges in the processing graph so >>> all data flows in one direction and you can't have a cycle. This only >>> allows for map 1 to producie map 2 which then produces map 3 which is then >>> used to update map 1 if all of that logic is within a single DoFn/Transform >>> or you create a cycle using an external system such as write to Kafka topic >>> X and read from Kafka topic X within the same pipeline or update a database >>> downstream from where it is read. There is a lot of ordering complexity and >>> stale data issues whenever using an external store to create a cycle though. >>> >>> On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan < >>> [email protected]> wrote: >>> >>>> Another way to put this question is, how do we write a beam pipeline >>>> for an existing pipeline (in Java) that has a dozen of custom objects and >>>> you have to work with multiple HashMaps of those custom objects in order to >>>> transform it. Currently, I am writing a beam pipeline by using the same >>>> Custom objects, getters and setters and HashMap<CustomObjects> *but >>>> inside a DoFn*. Is this the optimal way or does Beam offer something >>>> else? >>>> >>>> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan < >>>> [email protected]> wrote: >>>> >>>>> Hi Luke, >>>>> >>>>> We can say Map 2 as a kind of a template using which you want to >>>>> enrich data in Map 1. As I mentioned in my previous post, this is a high >>>>> level scenario. >>>>> >>>>> All these logic are spread across several classes (with ~4K lines of >>>>> code in total). As in any Java application, >>>>> >>>>> 1. The code has been modularized with multiple method calls >>>>> 2. Passing around HashMaps<CustomObject> as argument to each method >>>>> 3. Accessing the attributes of the custom object using getters and >>>>> setters. >>>>> >>>>> This is a common pattern in a normal Java application but I have not >>>>> seen such an example of code in Beam. >>>>> >>>>> >>>>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <[email protected]> wrote: >>>>> >>>>>> Who reads map 1? >>>>>> Can it be stale? >>>>>> >>>>>> It is unclear what you are trying to do in parallel and why you >>>>>> wouldn't stick all this logic into a single DoFn / stateful DoFn. >>>>>> >>>>>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hello Everyone, >>>>>>> >>>>>>> I am in the process of implementing an existing pipeline (written >>>>>>> using Java and Kafka) in Apache Beam. The data from the source stream is >>>>>>> contrived and had to go through several steps of enrichment using REST >>>>>>> API >>>>>>> calls and parsing of JSON data. The key >>>>>>> transformation in the existing pipeline is in shown below (a super >>>>>>> high level flow) >>>>>>> >>>>>>> *Method A* >>>>>>> ----Calls *Method B* >>>>>>> ----Creates *Map 1, Map 2* >>>>>>> ----Calls *Method C* >>>>>>> ----Read *Map 2* >>>>>>> ----Create *Map 3* >>>>>>> ----*Method C* >>>>>>> ----Read *Map 3* and >>>>>>> ----update *Map 1* >>>>>>> >>>>>>> The Map we use are multi-level maps and I am thinking of having >>>>>>> PCollections for each Maps and pass them as side inputs in a DoFn >>>>>>> wherever >>>>>>> I have transformations that need two or more Maps. But there are certain >>>>>>> tasks which I want to make sure that I am following right approach, for >>>>>>> instance updating one of the side input maps inside a DoFn. >>>>>>> >>>>>>> These are my initial thoughts/questions and I would like to get some >>>>>>> expert advice on how we typically design such an interleaved >>>>>>> transformation >>>>>>> in Apache Beam. Appreciate your valuable insights on this. >>>>>>> >>>>>>> -- >>>>>>> Thanks, >>>>>>> Praveen K Viswanathan >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> Praveen K Viswanathan >>>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> Praveen K Viswanathan >>>> >>> >> >> -- >> Thanks, >> Praveen K Viswanathan >> > -- Thanks, Praveen K Viswanathan
