Beam is really about parallelizing the processing. Using a single DoFn that does everything is fine as long as the DoFn can process elements in parallel (e.g. upstream source produces lots of elements). Composing multiple DoFns is great for re-use and testing but it isn't strictly necessary. Also, Beam doesn't support back edges in the processing graph so all data flows in one direction and you can't have a cycle. This only allows for map 1 to producie map 2 which then produces map 3 which is then used to update map 1 if all of that logic is within a single DoFn/Transform or you create a cycle using an external system such as write to Kafka topic X and read from Kafka topic X within the same pipeline or update a database downstream from where it is read. There is a lot of ordering complexity and stale data issues whenever using an external store to create a cycle though.
On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan < [email protected]> wrote: > Another way to put this question is, how do we write a beam pipeline for > an existing pipeline (in Java) that has a dozen of custom objects and you > have to work with multiple HashMaps of those custom objects in order to > transform it. Currently, I am writing a beam pipeline by using the same > Custom objects, getters and setters and HashMap<CustomObjects> *but > inside a DoFn*. Is this the optimal way or does Beam offer something else? > > On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan < > [email protected]> wrote: > >> Hi Luke, >> >> We can say Map 2 as a kind of a template using which you want to enrich >> data in Map 1. As I mentioned in my previous post, this is a high level >> scenario. >> >> All these logic are spread across several classes (with ~4K lines of code >> in total). As in any Java application, >> >> 1. The code has been modularized with multiple method calls >> 2. Passing around HashMaps<CustomObject> as argument to each method >> 3. Accessing the attributes of the custom object using getters and >> setters. >> >> This is a common pattern in a normal Java application but I have not seen >> such an example of code in Beam. >> >> >> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <[email protected]> wrote: >> >>> Who reads map 1? >>> Can it be stale? >>> >>> It is unclear what you are trying to do in parallel and why you wouldn't >>> stick all this logic into a single DoFn / stateful DoFn. >>> >>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan < >>> [email protected]> wrote: >>> >>>> Hello Everyone, >>>> >>>> I am in the process of implementing an existing pipeline (written using >>>> Java and Kafka) in Apache Beam. The data from the source stream is >>>> contrived and had to go through several steps of enrichment using REST API >>>> calls and parsing of JSON data. The key >>>> transformation in the existing pipeline is in shown below (a super high >>>> level flow) >>>> >>>> *Method A* >>>> ----Calls *Method B* >>>> ----Creates *Map 1, Map 2* >>>> ----Calls *Method C* >>>> ----Read *Map 2* >>>> ----Create *Map 3* >>>> ----*Method C* >>>> ----Read *Map 3* and >>>> ----update *Map 1* >>>> >>>> The Map we use are multi-level maps and I am thinking of having >>>> PCollections for each Maps and pass them as side inputs in a DoFn wherever >>>> I have transformations that need two or more Maps. But there are certain >>>> tasks which I want to make sure that I am following right approach, for >>>> instance updating one of the side input maps inside a DoFn. >>>> >>>> These are my initial thoughts/questions and I would like to get some >>>> expert advice on how we typically design such an interleaved transformation >>>> in Apache Beam. Appreciate your valuable insights on this. >>>> >>>> -- >>>> Thanks, >>>> Praveen K Viswanathan >>>> >>> >> >> -- >> Thanks, >> Praveen K Viswanathan >> > > > -- > Thanks, > Praveen K Viswanathan >
