Hi Luke - Thanks for the explanation. The limitation due to directed graph processing and the option of external storage clears most of the questions I had with respect to designing this pipeline. I do have one more scenario to clarify on this thread.
If I had a certain piece of logic that I had to use in more than one DoFns how do we do that. In a normal Java application, we can put it as a separate method and call it wherever we want. Is it possible to replicate something like that in Beam's DoFn? On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik <[email protected]> wrote: > Beam is really about parallelizing the processing. Using a single DoFn > that does everything is fine as long as the DoFn can process elements in > parallel (e.g. upstream source produces lots of elements). Composing > multiple DoFns is great for re-use and testing but it isn't strictly > necessary. Also, Beam doesn't support back edges in the processing graph so > all data flows in one direction and you can't have a cycle. This only > allows for map 1 to producie map 2 which then produces map 3 which is then > used to update map 1 if all of that logic is within a single DoFn/Transform > or you create a cycle using an external system such as write to Kafka topic > X and read from Kafka topic X within the same pipeline or update a database > downstream from where it is read. There is a lot of ordering complexity and > stale data issues whenever using an external store to create a cycle though. > > On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan < > [email protected]> wrote: > >> Another way to put this question is, how do we write a beam pipeline for >> an existing pipeline (in Java) that has a dozen of custom objects and you >> have to work with multiple HashMaps of those custom objects in order to >> transform it. Currently, I am writing a beam pipeline by using the same >> Custom objects, getters and setters and HashMap<CustomObjects> *but >> inside a DoFn*. Is this the optimal way or does Beam offer something >> else? >> >> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan < >> [email protected]> wrote: >> >>> Hi Luke, >>> >>> We can say Map 2 as a kind of a template using which you want to enrich >>> data in Map 1. As I mentioned in my previous post, this is a high level >>> scenario. >>> >>> All these logic are spread across several classes (with ~4K lines of >>> code in total). As in any Java application, >>> >>> 1. The code has been modularized with multiple method calls >>> 2. Passing around HashMaps<CustomObject> as argument to each method >>> 3. Accessing the attributes of the custom object using getters and >>> setters. >>> >>> This is a common pattern in a normal Java application but I have not >>> seen such an example of code in Beam. >>> >>> >>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <[email protected]> wrote: >>> >>>> Who reads map 1? >>>> Can it be stale? >>>> >>>> It is unclear what you are trying to do in parallel and why you >>>> wouldn't stick all this logic into a single DoFn / stateful DoFn. >>>> >>>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < >>>> [email protected]> wrote: >>>> >>>>> Hello Everyone, >>>>> >>>>> I am in the process of implementing an existing pipeline (written >>>>> using Java and Kafka) in Apache Beam. The data from the source stream is >>>>> contrived and had to go through several steps of enrichment using REST API >>>>> calls and parsing of JSON data. The key >>>>> transformation in the existing pipeline is in shown below (a super >>>>> high level flow) >>>>> >>>>> *Method A* >>>>> ----Calls *Method B* >>>>> ----Creates *Map 1, Map 2* >>>>> ----Calls *Method C* >>>>> ----Read *Map 2* >>>>> ----Create *Map 3* >>>>> ----*Method C* >>>>> ----Read *Map 3* and >>>>> ----update *Map 1* >>>>> >>>>> The Map we use are multi-level maps and I am thinking of having >>>>> PCollections for each Maps and pass them as side inputs in a DoFn wherever >>>>> I have transformations that need two or more Maps. But there are certain >>>>> tasks which I want to make sure that I am following right approach, for >>>>> instance updating one of the side input maps inside a DoFn. >>>>> >>>>> These are my initial thoughts/questions and I would like to get some >>>>> expert advice on how we typically design such an interleaved >>>>> transformation >>>>> in Apache Beam. Appreciate your valuable insights on this. >>>>> >>>>> -- >>>>> Thanks, >>>>> Praveen K Viswanathan >>>>> >>>> >>> >>> -- >>> Thanks, >>> Praveen K Viswanathan >>> >> >> >> -- >> Thanks, >> Praveen K Viswanathan >> > -- Thanks, Praveen K Viswanathan
