Re: Designing an existing pipeline in Beam

Luke Cwik Tue, 23 Jun 2020 18:35:31 -0700

You can apply the same DoFn / Transform instance multiple times in the
graph or you can follow regular development practices where the common code
is factored into a method and two different DoFn's invoke it.


On Tue, Jun 23, 2020 at 4:28 PM Praveen K Viswanathan <
[email protected]> wrote:

> Hi Luke - Thanks for the explanation. The limitation due to directed graph
> processing and the option of external storage clears most of the questions
> I had with respect to designing this pipeline. I do have one more scenario
> to clarify on this thread.
>
> If I had a certain piece of logic that I had to use in more than one DoFns
> how do we do that. In a normal Java application, we can put it as a
> separate method and call it wherever we want. Is it possible to replicate
> something like that in Beam's DoFn?
>
> On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik <[email protected]> wrote:
>
>> Beam is really about parallelizing the processing. Using a single DoFn
>> that does everything is fine as long as the DoFn can process elements in
>> parallel (e.g. upstream source produces lots of elements). Composing
>> multiple DoFns is great for re-use and testing but it isn't strictly
>> necessary. Also, Beam doesn't support back edges in the processing graph so
>> all data flows in one direction and you can't have a cycle. This only
>> allows for map 1 to producie map 2 which then produces map 3 which is then
>> used to update map 1 if all of that logic is within a single DoFn/Transform
>> or you create a cycle using an external system such as write to Kafka topic
>> X and read from Kafka topic X within the same pipeline or update a database
>> downstream from where it is read. There is a lot of ordering complexity and
>> stale data issues whenever using an external store to create a cycle though.
>>
>> On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan <
>> [email protected]> wrote:
>>
>>> Another way to put this question is, how do we write a beam pipeline for
>>> an existing pipeline (in Java) that has a dozen of custom objects and you
>>> have to work with multiple HashMaps of those custom objects in order to
>>> transform it. Currently, I am writing a beam pipeline by using the same
>>> Custom objects, getters and setters and HashMap<CustomObjects> *but
>>> inside a DoFn*. Is this the optimal way or does Beam offer something
>>> else?
>>>
>>> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan <
>>> [email protected]> wrote:
>>>
>>>> Hi Luke,
>>>>
>>>> We can say Map 2 as a kind of a template using which you want to enrich
>>>> data in Map 1. As I mentioned in my previous post, this is a high level
>>>> scenario.
>>>>
>>>> All these logic are spread across several classes (with ~4K lines of
>>>> code in total). As in any Java application,
>>>>
>>>> 1. The code has been modularized with multiple method calls
>>>> 2. Passing around HashMaps<CustomObject> as argument to each method
>>>> 3. Accessing the attributes of the custom object using getters and
>>>> setters.
>>>>
>>>> This is a common pattern in a normal Java application but I have not
>>>> seen such an example of code in Beam.
>>>>
>>>>
>>>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <[email protected]> wrote:
>>>>
>>>>> Who reads map 1?
>>>>> Can it be stale?
>>>>>
>>>>> It is unclear what you are trying to do in parallel and why you
>>>>> wouldn't stick all this logic into a single DoFn / stateful DoFn.
>>>>>
>>>>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello Everyone,
>>>>>>
>>>>>> I am in the process of implementing an existing pipeline (written
>>>>>> using Java and Kafka) in Apache Beam. The data from the source stream is
>>>>>> contrived and had to go through several steps of enrichment using REST 
>>>>>> API
>>>>>> calls and parsing of JSON data. The key
>>>>>> transformation in the existing pipeline is in shown below (a super
>>>>>> high level flow)
>>>>>>
>>>>>> *Method A*
>>>>>> ----Calls *Method B*
>>>>>>       ----Creates *Map 1, Map 2*
>>>>>> ----Calls *Method C*
>>>>>>      ----Read *Map 2*
>>>>>>      ----Create *Map 3*
>>>>>> ----*Method C*
>>>>>>      ----Read *Map 3* and
>>>>>>      ----update *Map 1*
>>>>>>
>>>>>> The Map we use are multi-level maps and I am thinking of having
>>>>>> PCollections for each Maps and pass them as side inputs in a DoFn 
>>>>>> wherever
>>>>>> I have transformations that need two or more Maps. But there are certain
>>>>>> tasks which I want to make sure that I am following right approach, for
>>>>>> instance updating one of the side input maps inside a DoFn.
>>>>>>
>>>>>> These are my initial thoughts/questions and I would like to get some
>>>>>> expert advice on how we typically design such an interleaved 
>>>>>> transformation
>>>>>> in Apache Beam. Appreciate your valuable insights on this.
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> Praveen K Viswanathan
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Praveen K Viswanathan
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Praveen K Viswanathan
>>>
>>
>
> --
> Thanks,
> Praveen K Viswanathan
>

Re: Designing an existing pipeline in Beam

Reply via email to