Re: Designing an existing pipeline in Beam

Praveen K Viswanathan Tue, 23 Jun 2020 16:29:02 -0700

Hi Luke - Thanks for the explanation. The limitation due to directed graph
processing and the option of external storage clears most of the questions
I had with respect to designing this pipeline. I do have one more scenario
to clarify on this thread.


If I had a certain piece of logic that I had to use in more than one DoFns
how do we do that. In a normal Java application, we can put it as a
separate method and call it wherever we want. Is it possible to replicate
something like that in Beam's DoFn?

On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik <[email protected]> wrote:

> Beam is really about parallelizing the processing. Using a single DoFn
> that does everything is fine as long as the DoFn can process elements in
> parallel (e.g. upstream source produces lots of elements). Composing
> multiple DoFns is great for re-use and testing but it isn't strictly
> necessary. Also, Beam doesn't support back edges in the processing graph so
> all data flows in one direction and you can't have a cycle. This only
> allows for map 1 to producie map 2 which then produces map 3 which is then
> used to update map 1 if all of that logic is within a single DoFn/Transform
> or you create a cycle using an external system such as write to Kafka topic
> X and read from Kafka topic X within the same pipeline or update a database
> downstream from where it is read. There is a lot of ordering complexity and
> stale data issues whenever using an external store to create a cycle though.
>
> On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan <
> [email protected]> wrote:
>
>> Another way to put this question is, how do we write a beam pipeline for
>> an existing pipeline (in Java) that has a dozen of custom objects and you
>> have to work with multiple HashMaps of those custom objects in order to
>> transform it. Currently, I am writing a beam pipeline by using the same
>> Custom objects, getters and setters and HashMap<CustomObjects> *but
>> inside a DoFn*. Is this the optimal way or does Beam offer something
>> else?
>>
>> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan <
>> [email protected]> wrote:
>>
>>> Hi Luke,
>>>
>>> We can say Map 2 as a kind of a template using which you want to enrich
>>> data in Map 1. As I mentioned in my previous post, this is a high level
>>> scenario.
>>>
>>> All these logic are spread across several classes (with ~4K lines of
>>> code in total). As in any Java application,
>>>
>>> 1. The code has been modularized with multiple method calls
>>> 2. Passing around HashMaps<CustomObject> as argument to each method
>>> 3. Accessing the attributes of the custom object using getters and
>>> setters.
>>>
>>> This is a common pattern in a normal Java application but I have not
>>> seen such an example of code in Beam.
>>>
>>>
>>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <[email protected]> wrote:
>>>
>>>> Who reads map 1?
>>>> Can it be stale?
>>>>
>>>> It is unclear what you are trying to do in parallel and why you
>>>> wouldn't stick all this logic into a single DoFn / stateful DoFn.
>>>>
>>>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello Everyone,
>>>>>
>>>>> I am in the process of implementing an existing pipeline (written
>>>>> using Java and Kafka) in Apache Beam. The data from the source stream is
>>>>> contrived and had to go through several steps of enrichment using REST API
>>>>> calls and parsing of JSON data. The key
>>>>> transformation in the existing pipeline is in shown below (a super
>>>>> high level flow)
>>>>>
>>>>> *Method A*
>>>>> ----Calls *Method B*
>>>>>       ----Creates *Map 1, Map 2*
>>>>> ----Calls *Method C*
>>>>>      ----Read *Map 2*
>>>>>      ----Create *Map 3*
>>>>> ----*Method C*
>>>>>      ----Read *Map 3* and
>>>>>      ----update *Map 1*
>>>>>
>>>>> The Map we use are multi-level maps and I am thinking of having
>>>>> PCollections for each Maps and pass them as side inputs in a DoFn wherever
>>>>> I have transformations that need two or more Maps. But there are certain
>>>>> tasks which I want to make sure that I am following right approach, for
>>>>> instance updating one of the side input maps inside a DoFn.
>>>>>
>>>>> These are my initial thoughts/questions and I would like to get some
>>>>> expert advice on how we typically design such an interleaved 
>>>>> transformation
>>>>> in Apache Beam. Appreciate your valuable insights on this.
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Praveen K Viswanathan
>>>>>
>>>>
>>>
>>> --
>>> Thanks,
>>> Praveen K Viswanathan
>>>
>>
>>
>> --
>> Thanks,
>> Praveen K Viswanathan
>>
>

-- 
Thanks,
Praveen K Viswanathan

Re: Designing an existing pipeline in Beam

Reply via email to