Re: Designing an existing pipeline in Beam

Praveen K Viswanathan Wed, 24 Jun 2020 12:57:25 -0700

Thanks Luke, I would like to try the latter approach. Would be able to
share any pseudo-code or point to any example on how to call a common
method inside a DoFn's, let's say, ProcessElement method?


On Tue, Jun 23, 2020 at 6:35 PM Luke Cwik <[email protected]> wrote:

> You can apply the same DoFn / Transform instance multiple times in the
> graph or you can follow regular development practices where the common code
> is factored into a method and two different DoFn's invoke it.
>
> On Tue, Jun 23, 2020 at 4:28 PM Praveen K Viswanathan <
> [email protected]> wrote:
>
>> Hi Luke - Thanks for the explanation. The limitation due to directed
>> graph processing and the option of external storage clears most of the
>> questions I had with respect to designing this pipeline. I do have one more
>> scenario to clarify on this thread.
>>
>> If I had a certain piece of logic that I had to use in more than one
>> DoFns how do we do that. In a normal Java application, we can put it as a
>> separate method and call it wherever we want. Is it possible to replicate
>> something like that in Beam's DoFn?
>>
>> On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik <[email protected]> wrote:
>>
>>> Beam is really about parallelizing the processing. Using a single DoFn
>>> that does everything is fine as long as the DoFn can process elements in
>>> parallel (e.g. upstream source produces lots of elements). Composing
>>> multiple DoFns is great for re-use and testing but it isn't strictly
>>> necessary. Also, Beam doesn't support back edges in the processing graph so
>>> all data flows in one direction and you can't have a cycle. This only
>>> allows for map 1 to producie map 2 which then produces map 3 which is then
>>> used to update map 1 if all of that logic is within a single DoFn/Transform
>>> or you create a cycle using an external system such as write to Kafka topic
>>> X and read from Kafka topic X within the same pipeline or update a database
>>> downstream from where it is read. There is a lot of ordering complexity and
>>> stale data issues whenever using an external store to create a cycle though.
>>>
>>> On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan <
>>> [email protected]> wrote:
>>>
>>>> Another way to put this question is, how do we write a beam pipeline
>>>> for an existing pipeline (in Java) that has a dozen of custom objects and
>>>> you have to work with multiple HashMaps of those custom objects in order to
>>>> transform it. Currently, I am writing a beam pipeline by using the same
>>>> Custom objects, getters and setters and HashMap<CustomObjects> *but
>>>> inside a DoFn*. Is this the optimal way or does Beam offer something
>>>> else?
>>>>
>>>> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Luke,
>>>>>
>>>>> We can say Map 2 as a kind of a template using which you want to
>>>>> enrich data in Map 1. As I mentioned in my previous post, this is a high
>>>>> level scenario.
>>>>>
>>>>> All these logic are spread across several classes (with ~4K lines of
>>>>> code in total). As in any Java application,
>>>>>
>>>>> 1. The code has been modularized with multiple method calls
>>>>> 2. Passing around HashMaps<CustomObject> as argument to each method
>>>>> 3. Accessing the attributes of the custom object using getters and
>>>>> setters.
>>>>>
>>>>> This is a common pattern in a normal Java application but I have not
>>>>> seen such an example of code in Beam.
>>>>>
>>>>>
>>>>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <[email protected]> wrote:
>>>>>
>>>>>> Who reads map 1?
>>>>>> Can it be stale?
>>>>>>
>>>>>> It is unclear what you are trying to do in parallel and why you
>>>>>> wouldn't stick all this logic into a single DoFn / stateful DoFn.
>>>>>>
>>>>>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hello Everyone,
>>>>>>>
>>>>>>> I am in the process of implementing an existing pipeline (written
>>>>>>> using Java and Kafka) in Apache Beam. The data from the source stream is
>>>>>>> contrived and had to go through several steps of enrichment using REST 
>>>>>>> API
>>>>>>> calls and parsing of JSON data. The key
>>>>>>> transformation in the existing pipeline is in shown below (a super
>>>>>>> high level flow)
>>>>>>>
>>>>>>> *Method A*
>>>>>>> ----Calls *Method B*
>>>>>>>       ----Creates *Map 1, Map 2*
>>>>>>> ----Calls *Method C*
>>>>>>>      ----Read *Map 2*
>>>>>>>      ----Create *Map 3*
>>>>>>> ----*Method C*
>>>>>>>      ----Read *Map 3* and
>>>>>>>      ----update *Map 1*
>>>>>>>
>>>>>>> The Map we use are multi-level maps and I am thinking of having
>>>>>>> PCollections for each Maps and pass them as side inputs in a DoFn 
>>>>>>> wherever
>>>>>>> I have transformations that need two or more Maps. But there are certain
>>>>>>> tasks which I want to make sure that I am following right approach, for
>>>>>>> instance updating one of the side input maps inside a DoFn.
>>>>>>>
>>>>>>> These are my initial thoughts/questions and I would like to get some
>>>>>>> expert advice on how we typically design such an interleaved 
>>>>>>> transformation
>>>>>>> in Apache Beam. Appreciate your valuable insights on this.
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Praveen K Viswanathan
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Praveen K Viswanathan
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Praveen K Viswanathan
>>>>
>>>
>>
>> --
>> Thanks,
>> Praveen K Viswanathan
>>
>

-- 
Thanks,
Praveen K Viswanathan

Re: Designing an existing pipeline in Beam

Reply via email to