I would like to see an example.

From: Joey Tran <joey.t...@schrodinger.com>
Sent: Tuesday, October 15, 2024 11:09 AM
To: user@beam.apache.org
Subject: Re: Transform Pattern Question

You don't often get email from 
joey.t...@schrodinger.com<mailto:joey.t...@schrodinger.com>. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
Thinking about it the past few days, I think I've arrived at the conclusion 
that generally shared transforms should also expose their dofn classes to make 
accommodating this kind of pattern easier. Then with a utility decorator/class 
that takes a dofn, we can just modify the wrapped dofn to operate on `KV`s and 
leave keys alone.

It's not the most ergonomic pattern imo since it requires more consideration of 
PTransforms vs DoFns and which abstraction level is right for your needs, and 
also knowing about this `Keyed[DoFn]` decorator, but it seems unavoidable.

On Sat, Oct 12, 2024 at 4:38 PM Henry Tremblay 
<paulhtremb...@gmail.com<mailto:paulhtremb...@gmail.com>> wrote:
We have a similiar question/issue at my work. 2 solutions come to mind:

1. Wrap your inputs, transforms, etc. in functions that you can call and the 
chain together

2. Use external libraries that a ParDo class can call. Then you can make these 
external libraries flexible and testable.

On Sat, Oct 12, 2024, 12:31 PM Joey Tran 
<joey.t...@schrodinger.com<mailto:joey.t...@schrodinger.com>> wrote:
Yes. But this is a hypothetical, there could also be many operations you might 
want to do with the initial data.

On Sat, Oct 12, 2024, 1:47 PM Henry Tremblay 
<paulhtremb...@gmail.com<mailto:paulhtremb...@gmail.com>> wrote:
So the only part of the pipeline you need to change is the transformation in 
the middle, after the read for the DB and before some type of write?

On Sat, Oct 12, 2024 at 3:29 AM <trs...@gmail.com<mailto:trs...@gmail.com>> 
wrote:
Sounds like you want a monad, heh.

It would be nice if their DoFn took a generic type and you could pass it a 
selector func to pick out what they need.
If you can access their dofn is not too complex, perhaps you just use their 
processElement implementation directly?

eg

class TheirDoFn ..{ void processElement(...){...} }

class YourDoFn .. {
  void processElement() {
    TheirDoFn().processElement(...)
  }
}

Depending on what annotations they're using in their processElement func, it 
could be trickier or not. You could pass in a mock implementation 
OutputReceiver, so you can wrap the results and delegate.

On Sat, 12 Oct 2024 at 08:51, XQ Hu via user 
<user@beam.apache.org<mailto:user@beam.apache.org>> wrote:
This sounds like what CDC (Change Data Capture) typically does, which usually 
runs as a streaming pipeline.

On Fri, Oct 11, 2024 at 3:51 PM Joey Tran 
<joey.t...@schrodinger.com<mailto:joey.t...@schrodinger.com>> wrote:
Another basic pattern question for the user group.

Say I have a database of records with an ID and some float property. Another 
team has written and published a transform `SquareRoot`. I want to write a 
pipeline that reads this database and outputs extended records that have (ID, 
foo_prop, squareroot(foo)_prop). How do I do this?

Of course I can strip my records of their ID and then pass in the properties 
straight into `SquareRoot`, but then I have no way to link it back to what 
record the square root corresponds to. Do I just need to ask the other team to 
make their SquareRootDoFn public? Should they have included a 
`SquareRoot.WithKey()` transform that ignores a key?

This feels like it'd be a common pattern but how to approach it feels awkward, 
not sure if I'm missing something obvious so thought I'd ask the group.

Cheers,
Joey

--

Joey Tran | Staff Developer | AutoDesigner TL

he/him

[Schrödinger, Inc.]<https://schrodinger.com/>


--
Henry Tremblay
Data Engineer, Best Buy

Reply via email to